Managing access of multiple executing programs to non-local block data storage

ABSTRACT

Techniques are described for managing access of executing programs to non-local block data storage. In some situations, a block data storage service uses multiple server storage systems to reliably store network-accessible block data storage volumes that may be used by programs executing on other physical computing systems. A group of multiple server block data storage systems that store block data volumes may in some situations be co-located at a data center, and programs that use volumes stored there may execute on other physical computing systems at that data center. If a program using a volume becomes unavailable, another program (e.g., another copy of the same program) may in some situations obtain access to and continue to use the same volume, such as in an automatic manner in some such situations.

BACKGROUND

Many companies and other organizations operate computer networks thatinterconnect numerous computing systems to support their operations,such as with the computing systems being co-located (e.g., as part of alocal network) or instead located in multiple distinct geographicallocations (e.g., connected via one or more private or publicintermediate networks). For example, data centers housing significantnumbers of co-located interconnected computing systems have becomecommonplace, such as private data centers that are operated by and onbehalf of a single organization, and public data centers that areoperated by entities as businesses. Some public data center operatorsprovide network access, power, and secure installation facilities forhardware owned by various customers, while other public data centeroperators provide “full service” facilities that also include hardwareresources made available for use by their customers. However, as thescale and scope of typical data centers and computer networks hasincreased, the task of provisioning, administering, and managing theassociated physical computing resources has become increasinglycomplicated.

The advent of virtualization technologies for commodity hardware hasprovided some benefits with respect to managing large-scale computingresources for many customers with diverse needs, allowing variouscomputing resources to be efficiently and securely shared betweenmultiple customers. For example, virtualization technologies such asthose provided by XEN, VMWare, or User-Mode Linux may allow a singlephysical computing system to be shared among multiple users by providingeach user with one or more virtual machines hosted by the singlephysical computing system, with each such virtual machine being asoftware simulation acting as a distinct logical computing system thatprovides users with the illusion that they are the sole operators andadministrators of a given hardware computing resource, while alsoproviding application isolation and security among the various virtualmachines. Furthermore, some virtualization technologies provide virtualresources that span one or more physical resources, such as a singlevirtual machine with multiple virtual processors that actually spansmultiple distinct physical computing systems.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a network diagram illustrating an example embodiment in whichmultiple computing systems execute programs and access reliablenon-local block data storage.

FIGS. 2A-2F illustrate examples of providing reliable non-local blockdata storage functionality to clients.

FIG. 3 is a block diagram illustrating example computing systemssuitable for managing the provision to and use by clients of reliablenon-local block data storage functionality.

FIG. 4 illustrates a flow diagram of an example embodiment of a BlockData Storage System Manager routine.

FIG. 5 illustrates a flow diagram of an example embodiment of a NodeManager routine.

FIG. 6 illustrates a flow diagram of an example embodiment of a BlockData Storage Server routine.

FIGS. 7A-7B illustrate a flow diagram of an example embodiment of aProgram Execution Service System Manager routine.

FIG. 8 illustrates a flow diagram of an example embodiment of a BlockData Storage Archival Manager routine.

DETAILED DESCRIPTION

Techniques are described for managing access of executing programs tonon-local block data storage. In at least some embodiments, thetechniques include providing a block data storage service that usesmultiple server storage systems to reliably store block data that may beaccessed and used over one or more networks by programs executing onother physical computing systems. Users of the block data storageservice may each create one or more block data storage volumes that eachhave a specified amount of block data storage space, and may initiateuse of such a block data storage volume (also referred to as a “volume”herein) by one or more executing programs, with at least some suchvolumes having copies stored by two or more of the multiple serverstorage systems so as to enhance volume reliability and availability tothe executing programs. As one example, the multiple server block datastorage systems that store block data may in some embodiments beorganized into one or more pools or other groups that each have multiplephysical server storage systems co-located at a geographical location,such as in each of one or more geographically distributed data centers,and the program(s) that use a volume stored on a server block datastorage system in a data center may execute on one or more otherphysical computing systems at that data center. Additional detailsrelated to embodiments of a block data storage service are includedbelow, and at least some of the described techniques for providing ablock data storage service may be automatically performed by embodimentsof a Block Data Storage (“BDS”) System Manager module.

In addition, in at least some embodiments, executing programs thataccess and use one or more such non-local block data storage volumesover one or more networks may each have an associated node manager thatmanages the access to those non-local volumes by the program, such as anode manager module that is provided by the block data storage serviceand/or that operates in conjunction with one or more BDS System Managermodules. For example, a first user who is a customer of the block datastorage service may create a first block data storage volume, andexecute one or more program copies on one or more computing nodes thatare instructed to access and use the first volume (e.g., in a serialmanner, in a simultaneous or other overlapping manner, etc.). When aprogram executing on a computing node initiates use of a non-localvolume, the program may mount or otherwise be provided with a logicalblock data storage device that is local to the computing node and thatrepresents the non-local volume, such as to allow the executing programto interact with the local logical block data storage device in the samemanner as any other local hard drive or other physical block datastorage device that is attached to the computing node (e.g., to performread and write data access requests, to implement a file system ordatabase or other higher-level data structure on the volume, etc.). Forexample, in at least some embodiments, a representative logical localblock data storage device may be made available to an executing programvia use of GNBD (“Global Network Block Device”) technology. In addition,as discussed in greater detail below, when the executing programinteracts with the representative local logical block data storagedevice, the associated node manager may manage those interactions bycommunicating over one or more networks with at least one of the serverblock data storage systems that stores a copy of the associatednon-local volume (e.g., in a manner transparent to the executing programand/or computing node) so as to perform the interactions on that storedvolume copy on behalf of the executing program. Furthermore, in at leastsome embodiments, at least some of the described techniques for managingaccess of executing programs to non-local block data storage volumes areautomatically performed by embodiments of a Node Manager module.

In addition, in at least some embodiments, at least some block datastorage volumes (or portions of those volumes) may further be stored onone or more remote archival storage systems that are distinct from theserver block data storage systems used to store volume copies. Invarious embodiments, the one or more remote archival storage systems maybe provided by the block data storage service (e.g., at a locationremote from a data center or other geographical location that has a poolof co-located server block data storage systems), or instead may beprovided by a remote long-term storage service and used by the blockdata storage, and in at least some embodiments the archival storagesystem may store data in a format other than block data (e.g., may storeone or more chunks or portions of a volume as distinct objects). Sucharchival storage systems may be used in various manners in variousembodiments to provide various benefits, as discussed in greater detailbelow. In some embodiments in which a remote long-term storage serviceprovides the archival storage systems, users of the block data storageservice (e.g., customers of the block data storage service who pay feesto use the block data storage service) who are also users of the remotelong-term storage service (e.g., customers of the remote long-termstorage service who pay fees to use the remote long-term storageservice) may have at least portions of their block data storage volumesstored by the archival storage systems, such as in response toinstructions from those customers. In other embodiments, a singleorganization may provide at least some of both block data storageservice capabilities and remote long-term storage service capabilities(e.g., in an integrated manner, such as part of a single service), whilein yet other embodiments the block data storage service may be providedin environments that do not include the use of archival data storagesystems. Furthermore, in at least some embodiments, the use of thearchival storage systems is automatically performed under control of oneor more archival manager modules, such as an archival manager moduleprovided by the block data storage service or otherwise provided tooperate in conjunction with modules of the block data storage service(e.g., provided by the remote long-term storage service to interact withthe block data storage service).

In some embodiments, at least some of the described techniques areperformed on behalf of a program execution service that managesexecution of multiple programs on behalf of multiple users of theprogram execution service. In some embodiments, the program executionservice may have groups of multiple co-located physical host computingsystems in or more geographic locations, such as in one or moregeographically distributed data centers, and may execute users' programson those physical host computing systems, such as under control of aprogram execution service (“PES”) system manager, as discussed ingreater detail below. In such embodiments, users of the programexecution service (e.g., customers of the program execution service whopay fees to use the program execution service) who are also users of theblock data storage service may execute programs that access and usenon-local block data storage volumes provided via the block data storageservice. In other embodiments, a single organization may provide atleast some of both program execution service capabilities and block datastorage service capabilities (e.g., in an integrated manner, such aspart of a single service), while in yet other embodiments the block datastorage service may be provided in environments that do not include aprogram execution service (e.g., internally to a business or otherorganization to support operations of the organization).

In addition, the host computing systems on which programs execute mayhave various forms in various embodiments. Multiple such host computingsystems may, for example, be co-located in a physical location (e.g., adata center), and may be managed by multiple node manager modules thatare each associated with a subset of one or more of the host computingsystems. At least some of the host computing systems may each includesufficient computing resources (e.g., volatile memory, CPU cycles orother CPU usage measure, network bandwidth, swap space, etc.) to executemultiple programs simultaneously, and, in at least some embodiments,some or all of the computing systems may each have one or morephysically attached local block data storage devices (e.g., hard disks,tape drives, etc.) that can be used to store local copies of programs tobe executed and/or data used by such programs. Furthermore, at leastsome of the host computing systems in some such embodiments may eachhost multiple virtual machine computing nodes that each may execute oneor more programs on behalf of a distinct user, with each such hostcomputing system having an executing hypervisor or other virtual machinemonitor that manages the virtual machines for that host computingsystem. For host computing systems that execute multiple virtualmachines, the associated node manager module for the host computingsystem may in some embodiments execute on at least one of multiplehosted virtual machines (e.g., as part of or in conjunction with thevirtual machine monitor for the host computing system), while in othersituations a node manager may execute on a physical computing systemdistinct from one or more other host computing systems being managed.

The server block data storage systems on which volumes are stored mayalso have various forms in various embodiments. As previously noted,multiple such server block data storage systems may, for example, beco-located in a physical location (e.g., a data center), and may bemanaged by one or more BDS System Manager modules. In at least someembodiments, some or all of the server block data storage systems may bephysical computing systems similar to the host computing systems thatexecute programs, and in some such embodiments may each execute serverstorage system software to assist in the provision and maintenance ofvolumes on those server storage systems. For example, in at least someembodiments, one or more of such server block data storage computingsystems may execute at least part of the BDS System Manager, such as ifone or more BDS System Manager modules are provided in a distributedpeer-to-peer manner by multiple interacting server block data storagecomputing systems. In other embodiments, at least some of the serverblock data storage systems may be network storage devices that may lacksome I/O components and/or other components of physical computingsystems, such as if at least some of the provision and maintenance ofvolumes on those server storage systems is performed by other remotephysical computing systems (e.g., by a BDS System Manager moduleexecuting on one or more other computing systems). In addition, in someembodiments, at least some server block data storage systems eachmaintains multiple local hard disks, and stripes at least some volumesacross a portion of each of some or all of the local hard disks.Furthermore, various types of techniques for creating and using volumesmay be used, including in some embodiments to use LVM (“Logical VolumeManager”) technology.

As previously noted, in at least some embodiments, some or all blockdata storage volumes each have copies stored on two or more distinctserver block data storage systems, such as to enhance reliability andavailability of the volumes. By doing so, failure of a single serverblock data storage system may not cause access of executing programs toa volume to be lost, as use of that volume by those executing programsmay be switched to another available server block data storage systemthat has a copy of that volume. In such embodiments, consistency may bemaintained between the multiple copies of a volume on the multipleserver block data storage systems in various ways. For example, in someembodiments, one of the server block data storage systems is designatedas storing the primary copy of the volume, and the other one or moreserver block data storage systems are designated as storing mirrorcopies of the volume—in such embodiments, the server block data storagesystem that has the primary volume copy (referred to as the “primaryserver block data storage system” for the volume) may receive and handledata access requests for the volume, and in some such embodiments mayfurther take action to maintain the consistency of the other mirrorvolume copies (e.g., by sending update messages to the other serverblock data storage systems that provide the mirror volume copies whendata in the primary volume copy is modified, such as in a master-slavecomputing relationship manner). Various types of volume consistencytechniques may be used, with additional details included below.

In at least some embodiments, the described techniques include providingreliable and available access of an executing program on a computingnode to a block data storage volume by managing use of the primary andmirror copies of the volume. For example, the node manager for theexecuting program may in some embodiments interact solely with theprimary volume copy via the primary server block data storage system,such as if the primary volume copy is responsible for maintaining themirror volume copies or if another replication mechanism is used. Insuch embodiments, if the primary server block data storage system failsto respond to a request sent by the node manager (e.g., a data accessrequest initiated by the executing program, a ping message or otherrequest initiated by the node manager to periodically check that theprimary server block data storage system is available, etc.) within apredefined amount of time, or if the node manager is otherwise alertedthat the primary volume copy is unavailable (e.g., by a message from theBDS System Manager), the node manager may automatically switch itsinteractions to one of the mirror volume copies on a mirror server blockdata storage system (e.g., with the executing program being unaware ofthe switch, other than possibly waiting for a slightly longer time toobtain a response to a data access request made by the executing programif it was that data access request that timed out and initiated theswitch to the mirror volume copy). The mirror volume copy may beselected in various ways, such as if it is the only one, if an order inwhich to access multiple mirror volume copies was previously indicated,by interacting with the BDS System Manager to request an indication ofwhich mirror volume copy is promoted to act as the primary volume copy,etc. In other embodiments, at least some volumes may have multipleprimary copies, such as if a volume is available for simultaneous readaccess by multiple executing programs and the resulting data access loadis spread across multiple primary copies of the volume—in suchembodiments, a node manager may select one of the multiple primaryvolume copies with which to interact in various ways (e.g., in a randommanner, based on an instruction from a BDS System Manager module, etc.).

In addition, the BDS System Manager may take various actions in variousembodiments to maintain reliable and available access of an executingprogram on a computing node to a block data storage volume. Inparticular, if the BDS System Manager becomes aware that a particularserver block data storage system (or a particular volume on a particularserver block data storage system) becomes unavailable, the BDS SystemManager may take various actions for some or all volumes stored by thatserver block data storage system (or for the particular unavailablevolume) to maintain its availability. For example, for each storedprimary volume copy on the unavailable server block data storage system,the BDS System Manager may promote one of the existing mirror volumecopies to be the new primary volume copy, and optionally notify one ormore node managers of the change (e.g., the node managers for anyexecuting programs that are currently using the volume). Furthermore,for each stored volume copy, the BDS System Manager may initiatecreation of at least one other new mirror copy of the volume on adifferent server block data storage system, such as by replicating anexisting copy of the volume on another available server block datastorage system that has an existing copy (e.g., by replicating theprimary volume copy). In addition, in at least some embodiments, otherbenefits may be achieved in at least some situations by using at leastportions of a volume that are stored on remote archival storage systemsto assist in replicating a new mirror copy of the volume (e.g., greaterdata reliability, an ability to minimize an amount of storage used formirror volume copies and/or ongoing processing power used to maintainfull mirror copies of volumes, etc.), as discussed in greater detailbelow.

The BDS System Manager may become aware of the unavailability of aserver block data storage system in various ways, such as based on amessage from a node manager that cannot contact the server block datastorage system, based on a message from the server block data storagesystem (e.g., to indicate that it has suffered an error condition, hasbegun a shutdown or failure mode operation, etc.), based on an inabilityto contact the server block data storage system (e.g., based on periodicor constant monitoring of some or all of the server block data storagesystems), etc. Furthermore, unavailability of a server block datastorage system may be caused by various occurrences in variousembodiments, such as failure of one or more hard disks or other storagemediums on which the server block data storage system stores at least aportion of one or more volumes, failure of one or more other componentsof the server block data storage system (e.g., the CPU, memory, a fan,etc.), an electrical power failure to the server block data storagesystem (e.g., a power failure to a single server block data storagesystem, to a rack of multiple server block data storage systems, to anentire data center, etc.), a network or other communication failure thatprevents the server block data storage system from communicating with anode manager and/or the BDS System Manager, etc. In some embodiments,failure of or problems with any component of a server block data storagesystem may be considered to be an unavailability condition for theentire server block data storage system (e.g., in embodiments in which aserver block data storage system maintains multiple local hard disks,failure of or problems with any of the local hard disks may beconsidered to be an unavailability condition for the entire server blockdata storage system), while in other embodiments a server block datastorage system will not be considered to be unavailable as long as it isable to respond to data access requests.

Furthermore, in addition to moving one or more volumes from an existingserver block data storage system when that server block data storagesystem becomes unavailable, the BDS System Manager may in someembodiments decide to move one or more volumes from an existing serverblock data storage system to a different server block data storagesystem and/or decide to create a new copy of one or more volumes atvarious other times and for various other reasons. Such a movement of orcreation of a new copy of a volume may be performed in a manner similarto that discussed in greater detail elsewhere (e.g., by replicating theprimary copy of the volume to create a new copy, and by optionallyremoving the prior copy of the volume in at least some situations, suchas when the volume copy is being moved). Situations that may prompt avolume move or new volume copy creation include, for example, thefollowing non-exclusive list: a particular server block data storagesystem may become over-utilized (e.g., based on usage of CPU, networkbandwidth, I/O access, storage capacity, etc.), such as to triggermovement of one or more volumes from that server block data storagesystem; a particular server block data storage system may lacksufficient resources for a desired modification of an existing volume(e.g., may lack sufficient available storage space if the size of anexisting volume is requested to be expanded), such as to triggermovement of one or more volumes from that server block data storagesystem; a particular server block data storage system may needmaintenance or upgrades that will cause it to be unavailable for aperiod of time, such as to trigger temporary or permanent movement ofone or more volumes from that server block data storage system; based onrecognition that usage patterns for a particular volume or othercharacteristics of a volume may be better accommodated on other serverblock data storage systems, such as another server block data storagesystem with additional capabilities (e.g., for volumes that havefrequent data modifications, to use a primary server block data storagesystem with higher-than-average disk write capabilities, and/or forvolumes that are very large in size, to use a primary server block datastorage system with higher-than-average storage capacity); in responseto a request from a user who created or is otherwise associated with avolume (e.g., in response to the user purchasing premium access to aserver block data storage system having enhanced capabilities); toprovide at least one new copy of a volume in a different geographicallocation (e.g., another data center) at which programs execute, such asto trigger movement of and/or copying of the volume from a server blockdata storage system at a first geographical location when use of avolume by an executing program at another geographical location isrequested; etc.

In addition, after a volume has been moved or a new copy created, theBDS System Manager may in some embodiments and situations update one ormore node managers as appropriate (e.g., only node managers forexecuting programs currently using the volume, all node managers, etc.).In other embodiments, various information about volumes may bemaintained in other manners, such as by having one or more copies of avolume information database that is network-accessible to node managersand/or the BDS System Manager. A non-exclusive list of types ofinformation about volumes that may be maintained includes the following:an identifier for a volume, such as an identifier that is unique for theserver block data storage systems that store copies of the volume orthat is globally unique for the block data storage service; restrictedaccess information for a volume, such as passwords or encryption keys,or lists or other indications of authorized users for the volume;information about the primary server block data storage system for thevolume, such as a network address and/or other access information;information about one or more mirror server block data storage systemsfor the volume, such as information about an ordering that indicateswhich mirror server block data storage system will be promoted to be theprimary system if the existing primary server storage system becomesunavailable, a network address and/or other access information, etc.;information about any snapshot volume copies that have been created forthe volume, as described in greater detail below; information aboutwhether the volume is to be available to users other than the creator ofthe volume, and if so under what circumstances (e.g., for read accessonly, for other users to make their own volumes that are copies of thisvolume, pricing information for other users to receive various types ofaccess to the volume); etc.

In addition to maintaining reliable and available access of executingprograms to block data storage volumes by moving or otherwisereplicating volume copies when server block data storage systems becomeunavailable, the block data storage service may perform other actions inother situations to maintain access of executing programs to block datastorage volumes. For example, if a first executing program unexpectedlybecomes unavailable, in some embodiments the block data storage serviceand/or program execution service may take actions to have a differentsecond executing program (e.g., a second copy of the same program thatis executing on a different host computing system) attach to some or allblock data storage volumes that were in use by the unavailable firstprogram, so that the second program can quickly take over at least someoperations of the unavailable first program. The second program may insome situations be a new program whose execution is initiated by theunavailability of the existing first program, while in other situationsthe second program may already be executing (e.g., if multiple programcopies are concurrently executed to share an overall load of work, suchas multiple Web server programs that receive different incoming clientrequests as mediated by a load balancer, with one of the multipleprogram copies being selected to be the second program; if the secondprogram is a standby copy of the program that is executing to allow a“hot” swap from the existing first program in the event ofunavailability, such as without the standby program copy being activelyused until the unavailability of the existing first program occurs;etc.). In addition, in some embodiments, a second program to which anexisting volume's attachment and ongoing use is switched may be onanother host physical computing system in the same geographical location(e.g., the same data center) as the first program, while in otherembodiments the second program may be at a different geographicallocation (e.g., a different data center, such as in conjunction with acopy of the volume that was previously or concurrently moved to thatother data center and will be used by that second program). Furthermore,in some embodiments, other related actions may be taken to furtherfacilitate the switch to the second program, such as by redirecting somecommunications intended for the unavailable first program to the secondprogram.

In addition, in at least some embodiments, other techniques may be usedto provide reliable and available access to block data storage volumes,as well as other benefits, such as to allow a copy of an indicatedvolume to be saved to one or more remote archival storage systems (e.g.,at a second geographical location that is remote from a firstgeographical location at which the server block data storage systemsstore the active primary and mirror copies of the volume and/or that isremote from the host physical computing systems that execute theprograms that use the volume), such as for long-term backups and/orother purposes. For example, in some embodiments, the archival storagesystems may be provided by a remote network-accessible storage service.In addition, the copies of a volume that are saved to the archivalstorage systems may in at least some situations be snapshot copies ofthe volume at a particular point in time, but which are notautomatically updated as ongoing use of the volume causes its storedblock data contents to change, and/or which are not available to beattached to and used by executing programs in the same manner asvolumes. Thus, as one example, a long-term snapshot copy of a volume maybe used, for example, as a backup copy of a volume, and may further insome embodiments serve as the basis of one or more new volumes that arecreated from the snapshot copy (e.g., such that the new volumes beginwith the same block data storage contents as the snapshot copy).

In addition, the snapshot copies of a volume at the archival storagesystems may be stored in various manners, such as to represent smallerchunks of a volume (e.g., if the archival storage systems store data assmaller objects rather than a large linear sequential block of data).For example, a volume may be represented as a series of multiple smallerchunks (with a volume having a size of, for example, one gigabyte or oneterabyte, and with a chunk having a size that is, for example, a fewmegabytes), and information about some or all chunks (e.g., each chunkthat is modified) may be stored separately on the archival storagesystems, such as by treating each chunk as a distinct stored object.Furthermore, in at least some embodiments, a second and later snapshotcopy of a particular volume may be created in such a manner as to storeonly incremental changes from a prior snapshot copy of the volume, suchas by including stored copies of new storage chunks that have beencreated or modified since the prior snapshot copy, but sharing storedcopies of some previously existing chunks with the prior snapshot copyif those chunks have not changed. In such embodiments, if a priorsnapshot copy is later deleted, the previously existing chunks stored bythat prior snapshot copy that are shared by any later snapshot copiesmay be retained for use by those later snapshot copies, while non-sharedpreviously existing chunks stored by that prior snapshot copy may bedeleted.

In addition, in at least some embodiments, when creating a snapshot copyof a volume at a point in time, access to the primary volume copy byexecuting programs may be allowed to continue, including allowingmodifications to the data stored in the primary volume copy, but withoutany such ongoing data modifications being reflected in the snapshotcopy, such as if the snapshot copy is based on volume chunks stored onthe archival storage systems that are not updated once the snapshot copycreation begins until the snapshot copy creation is completed. Forexample, in at least some embodiments, copy-on-write techniques are usedwhen creation of a snapshot copy of a volume is initiated and a chunk ofthe volume is subsequently modified, such as to initially maintainstored copies of both the unmodified chunk and the modified chunk on theprimary server block data storage system that stores the primary volumecopy (and optionally as well on the mirror server block data storagesystems that store one or more of the mirror copies of the volume). Whenconfirmation is received that the archival storage systems havesuccessfully stored the snapshot copy of the volume (including a copy ofthe unmodified chunk), the unmodified chunk copy on the primary serverblock data storage system (and optionally on the mirror server blockdata storage systems) may then be deleted.

Moreover, such volume chunks or other volume data stored on the archivalstorage systems may be used in other manners in at least someembodiments, such as to use the archival storage systems as a backingstore for the primary and/or mirror copies of some or all volumes. Forexample, volume data stored on the archival storage systems may be usedto assist in maintaining consistency between multiple copies of a volumeon multiple server block data storage systems in at least somesituations. As one example, one or more mirror copies of a volume may becreated or updated based at least in part on volume chunks stored on thearchival storage systems, such as to minimize or eliminate a need toaccess the primary volume copy to obtain at least some of the volumechunks. For example, if the primary volume copy is updated more quicklyor more reliably than modified chunks on the archival storage systems, anew mirror volume copy may be created by using at least some volumechunks stored on the archival storage systems that are known to beaccurate (e.g., from a recent snapshot volume copy), and by accessingthe primary volume copy only to obtain portions of the volume thatcorrespond to chunks that may have been modified subsequent to creationof the snapshot volume copy. Similarly, if the modified chunks on thearchival storage systems reliably reflect a current state of a primaryvolume copy, a mirror volume copy may be updated using those modifiedchunks rather than via interactions by or with the primary volume copy.

In addition, in some embodiments, the amount of data that is stored in amirror volume copy (and the resulting size of the mirror volume copy)may be much less than that of the primary copy of the volume, such as ifvolume information on the archival storage systems is used in place ofat least some data that would otherwise be stored in such a minimalmirror volume copy. As one example, once a snapshot copy of a volume iscreated on one or more archival storage systems, a minimal mirror copyof a volume need not in such embodiments store any of the volume datathat is present in the snapshot volume copy. As modifications are madeto the primary volume copy after the snapshot copy creation, some or allof those data modifications may also be made to the minimal mirrorvolume copy (e.g., all of the data modifications, only the datamodifications that are not reflected in modified volume chunks stored onthe archival storage systems, etc.)—then, if access to the minimalmirror volume copy is later needed, such as if the minimal mirror volumecopy is promoted to be the primary volume copy, the other data that ismissing from the minimal mirror volume copy (e.g., the non-modifiedportions of the volume) may be restored by retrieving it from thearchival storage systems (e.g., from the prior snapshot volume copy). Inthis manner, volume reliability may be enhanced, while also minimizingthe amount of storage space used on the server block data storagesystems for the mirror volume copies.

In yet other embodiments, the definitive or master copy of a volume maybe maintained on the archival storage systems, and the primary andmirror copies of the volume may reflect a cache or other subset of thevolume (e.g., a subset that has been recently accessed and/or that isexpected to be accessed soon)—in such embodiments, the non-local blockdata storage volume of the block data storage service may be used toprovide a more proximate source of volume data for access by executingprograms than the remote archival storage systems. In addition, in atleast some such embodiments, a volume may be described to users as beingof a particular size that corresponds to the master copy maintained onthe archival storage systems, but with the primary and mirror copiesbeing a smaller size. Furthermore, in at least some such embodiments,lazy updating techniques may be used to immediately update a copy ofdata in a first data store (e.g., a primary volume copy on a serverblock data storage system) but to update the copy of that same data in adistinct second data store (e.g., the archival storage systems) later,such as in a manner to maintain strict data consistency at the seconddata store by ensuring that write operations or other data modificationsto a portion of a volume are updated at the second data store beforeperforming any subsequent read operation or other access of that portionof the volume from the second data store (e.g., by using write-backcache updating techniques). Such lazy updating techniques may be used,for example, when updating modified chunks of a volume on archivalstorage systems, or when updating a mirror volume copy from modifiedchunks of the volume that are stored on archival storage systems. Inother embodiments, other techniques may be used when updating modifiedchunks of a volume on archival storage systems, such as to usewrite-through cache techniques to immediately update the copy of data inthe second data store (e.g., on the archival storage systems) when thecopy of the data in the first data store is modified.

Such snapshot volume copies stored on archival storage systems providevarious other benefits as well. For example, if all primary and mirrorcopies of a volume are stored on multiple server block data storagesystems at a single geographical location (e.g., a data center), and thecomputing and storage systems at that geographical location becomeunavailable (e.g., electricity is lost to an entire data center), theexistence of a recent snapshot copy of the volume at a different remotestorage location may ensure that a recent version of the volume isavailable when the computing and storage systems at the geographicallocation later become available again (e.g., when electricity isrestored), such as if data from one or more server storage systems atthe geographical location is lost. Furthermore, in such a situation, oneor more new copies of the volume may be created at one or more newgeographical locations based on a recent long-term snapshot copy of thevolume from the remote archival storage systems, such as to allow one ormore executing program copies outside an unavailable geographicallocation to access and use those new volume copies. Additional detailsrelated to archival storage systems and their use are included below.

As previously noted, in at least some embodiments, some or all blockdata storage volumes each have copies stored on two or more distinctserver block data storage systems at a single geographical location,such as within the same data center in which executing programs willaccess the volume—by locating all of the volume copies and executingprograms at the same data center or other geographical location, variousdesired data access characteristics may be maintained (e.g., based onone or more internal networks at that data center or other geographicallocation), such as latency and throughput. For example, in at least someembodiments, the described techniques may provide access to non-localblock data storage that has access characteristics that are similar toor better than access characteristics of local physical block datastorage devices, but with much greater reliability that is similar to orexceeds reliability characteristics of RAID (“Redundant Array ofIndependent/Inexpensive Disks”) systems and/or dedicated SANs (“StorageArea Networks”) and at much lower cost. In other embodiments, theprimary and mirror copies for at least some volumes may instead bestored in other manners, such as at different geographical locations(e.g., different data centers), such as to further maintain availabilityof a volume even if an entire data center becomes unavailable. Inembodiments in which volume copies may be stored at differentgeographical locations, a user may in some situations request that aparticular program be executed proximate to a particular volume (e.g.,at the same data center at which the primary volume copy is located), orthat a particular volume be located proximate to a particular executingprogram, such as to provide relatively high network bandwidth and lowlatency for communications between the executing program and primaryvolume copy.

Furthermore, access to some or all of the described techniques may insome embodiments be provided in a fee-based or other paid manner to atleast some users. For example, users may pay one-time fees, periodic(e.g., monthly) fees and/or one or more types of usage-based fees to usethe block data storage service to store and access volumes, to use theprogram execution service to execute programs, and/or to use archivalstorage systems (e.g., provided by a remote long-term storage service)to store long-term backups or other snapshot copies of volumes. Fees maybe based on one or more factors and activities, such as indicated in thefollowing non-exclusive list: based on the size of a volume, such as tocreate the volume (e.g., as a one-time fee), to have ongoing storageand/or use of the volume (e.g., a monthly fee), etc.; based on non-sizecharacteristics of a volume, such as a number of mirror copies,characteristics of server block data storage systems (e.g., data accessrates, storage sizes, etc.) on which the primary and/or mirror volumecopies are stored, and/or a manner in which the volume is created (e.g.,a new volume that is empty, a new volume that is a copy of an existingvolume, a new volume that is a copy of a snapshot volume copy, etc.);based on the size of a snapshot volume copy, such as to create thesnapshot volume copy (e.g., as a one-time fee) and/or have ongoingstorage of the volume (e.g., a monthly fee); based on the non-sizecharacteristics of one or more snapshot volume copies, such as a numberof snapshots of a single volume, whether a snapshot copy is incrementalwith respect to one or more prior snapshot copies, etc.; based on usageof a volume, such as the amount of data transferred to and/or from avolume (e.g., to reflect an amount of network bandwidth used), a numberof data access requests sent to a volume, a number of executing programsthat attach to and use a volume (whether sequentially or concurrently),etc.; based on the amount of data transferred to and/or from a snapshot,such as in a manner similar to that for volumes; etc. In addition, theprovided access may have various forms in various embodiments, such as aone-time purchase fee, an ongoing rental fee, and/or based on anotherongoing subscription basis. Furthermore, in at least some embodimentsand situations, a first group of one or more users may provide data toother users on a fee-based basis, such as to charge the other users forreceiving access to current volumes and/or historical snapshot volumecopies created by one or more users of the first group (e.g., byallowing them to make new volumes that are copies of volumes and/or ofsnapshot volume copies; by allowing them to use one or more createdvolumes; etc.), whether as a one-time purchase fee, an ongoing rentalfee, or on another ongoing subscription basis.

In some embodiments, one or more APIs (“application programminginterfaces”) may be provided by the block data storage service, programexecution service and/or remote long-term storage service, such as toallow other programs to programmatically initiate various types ofoperations to be performed (e.g., as directed by users of the otherprograms). Such operations may allow some or all of the previouslydescribed types of functionality to be invoked, and include, but are notlimited to, the following types of operations: to create, delete,attach, detach, or describe volumes; to create, delete, copy or describesnapshots; to specify access rights or other metadata for volumes and/orsnapshots; to manage execution of programs; to provide payment to obtainother types of functionality; to obtain reports and other informationabout use of capabilities of one or more of the services and/or aboutfees paid or owed for such use; etc. The operations provided by the APImay be invoked by, for example, executing programs on host computingsystems of the program execution service and/or by computing systems ofcustomers or other users that are external to the one or moregeographical locations used by the block data storage service and/orprogram execution service.

For illustrative purposes, some embodiments are described below in whichspecific types of block data storage is provided in specific ways tospecific types of programs executing on specific types of computingsystems. These examples are provided for illustrative purposes and aresimplified for the sake of brevity, and the inventive techniques can beused in a wide variety of other situations, some of which are discussedbelow, and the techniques are not limited to use with virtual machines,data centers or other specific types of data storage systems, computingsystems or computing system arrangements. In addition, while someembodiments are discussed as providing and using reliable non-localblock data storage, in other embodiments types of data storage otherthan block data storage may similarly be provided.

FIG. 1 is a network diagram illustrating an example embodiment in whichmultiple computing systems execute programs and access reliablenon-local block data storage, such as under the control of a block datastorage service and/or program execution service. In particular, in thisexample, a program execution service manages the execution of programson various host computing systems located within a data center 100, anda block data storage service uses multiple other server block datastorage systems at the data center to provide reliable non-local blockdata storage to those executing programs. Multiple remote archivalstorage systems external to the data center may also be used to storeadditional copies of at least some portions of at least some block datastorage volumes.

In this example, data center 100 includes a number of racks 105, andeach rack includes a number of host computing systems, as well as anoptional rack support computing system 122 in this example embodiment.The host computing systems 110 a-c on the illustrated rack 105 each hostone or more virtual machines 120 in this example, as well as a distinctNode Manager module 115 associated with the virtual machines on thathost computing system to manage those virtual machines. One or moreother host computing systems 135 also each host one or more virtualmachines 120 in this example. Each virtual machine 120 may act as anindependent computing node for executing one or more program copies (notshown) for a user (not shown), such as a customer of the programexecution service. In addition, this example data center 100 furtherincludes additional host computing systems 130 a-b that do not includedistinct virtual machines, but may nonetheless each act as a computingnode for one or more programs (not shown) being executed for a user. Inthis example, a Node Manager module 125 executing on a computing system(not shown) distinct from the host computing systems 130 a-b and 135 isassociated with those host computing systems to manage the computingnodes provided by those host computing systems, such as in a mannersimilar to the Node Manager modules 115 for host computing systems 110.The rack support computing system 122 may provide various utilityservices for other computing systems local to its rack 105 (e.g.,long-term program storage, metering and other monitoring of programexecution and/or of non-local block data storage access performed byother computing systems local to the rack, etc.), as well as possibly toother computing systems located in the data center. Each computingsystem 110, 130 and 135 may also have one or more local attached storagedevices (not shown), such as to store local copies of programs and/ordata created by or otherwise used by the executing programs, as well asvarious other components.

In this example, an optional computing system 140 is also illustratedthat executes a PES System Manager module for the program executionservice to assist in managing the execution of programs on the computingnodes provided by the host computing systems located within the datacenter (or optionally on computing systems located in one or more otherdata centers 160, or other remote computing systems 180 external to thedata center). As discussed in greater detail elsewhere, a PES SystemManager module may provide a variety of services in addition to managingexecution of programs, including the management of user accounts (e.g.,creation, deletion, billing, etc.); the registration, storage, anddistribution of programs to be executed; the collection and processingof performance and auditing data related to the execution of programs;the obtaining of payment from customers or other users for the executionof programs; etc. In some embodiments, the PES System Manager module maycoordinate with the Node Manager modules 115 and 125 to manage programexecution on computing nodes associated with the Node Manager modules,while in other embodiments the Node Manager modules 115 and 125 may notassist in managing such execution of programs.

This example data center 100 also includes a computing system 175 thatexecutes a Block Data Storage (“BDS”) System Manager module for theblock data storage service to assist in managing the availability ofnon-local block data storage to programs executing on computing nodesprovided by the host computing systems located within the data center(or optionally on computing systems located in one or more other datacenters 160, or other remote computing systems 180 external to the datacenter). In particular, in this example, the data center 100 includes apool of multiple server block data storage systems 165, which each havelocal block storage for use in storing one or more volume copies 155.Access to the volume copies 155 is provided over the internal network(s)185 to programs executing on computing nodes 120 and 130. As discussedin greater detail elsewhere, a BDS System Manager module may provide avariety of services related to providing non-local block data storagefunctionality, including the management of user accounts (e.g.,creation, deletion, billing, etc.); the creation, use and deletion ofblock data storage volumes and snapshot copies of those volumes; thecollection and processing of performance and auditing data related tothe use of block data storage volumes and snapshot copies of thosevolumes; the obtaining of payment from customers or other users for theuse of block data storage volumes and snapshot copies of those volumes;etc. In some embodiments, the BDS System Manager module may coordinatewith the Node Manager modules 115 and 125 to manage use of volumes byprograms executing on associated computing nodes, while in otherembodiments the Node Manager modules 115 and 125 may not be used tomanage such volume use. In addition, in other embodiments, one or moreBDS System Manager modules may be structured in other manners, such asto have multiple instances of the BDS System Manager executing in asingle data center (e.g., to share the management of non-local blockdata storage by programs executing on the computing nodes provided bythe host computing systems located within the data center), and/or suchas to have at least some of the functionality of a BDS System Managermodule being provided in a distributed manner by software executing onsome or all of the server block data storage systems 165 (e.g., in apeer-to-peer manner, without any separate centralized BDS System Managermodule on a computing system 175).

In this example, the various host computing systems 110, 130 and 135,server block data storage systems 165, and computing systems 125, 140and 175 are interconnected via one or more internal networks 185 of thedata center, which may include various networking devices (e.g.,routers, switches, gateways, etc.) that are not shown. In addition, theinternal networks 185 are connected to an external network 170 (e.g.,the Internet or other public network) in this example, and the datacenter 100 may further include one or more optional devices (not shown)at the interconnect between the data center 100 and an external network170 (e.g., network proxies, load balancers, network address translationdevices, etc.). In this example, the data center 100 is connected viathe external network 170 to one or more other data centers 160 that eachmay include some or all of the computing systems and storage systemsillustrated with respect to data center 100, as well as other remotecomputing systems 180 external to the data center. The other computingsystems 180 may be operated by various parties for various purposes,such as by the operator of the data center 100 or third parties (e.g.,customers of the program execution service and/or of the block datastorage service). In addition, one or more of the other computingsystems 180 may be archival storage systems (e.g., as part of a remotenetwork-accessible storage service) with which the block data storageservice may interact, such as under control of one or more archivalmanager modules (not shown) that execute on the one or more othercomputing systems 180 or instead on one or more computing systems of thedata center 100, as described in greater detail elsewhere. Furthermore,while not illustrated here, in at least some embodiments, at least someof the server block data storage systems 165 may further beinterconnected with one or more other networks or other connectionmediums, such as a high-bandwidth connection over which the serverstorage systems 165 may share volume data (e.g., for purposes ofreplicating copies of volumes and/or maintaining consistency betweenprimary and mirror copies of volumes), with such a high-bandwidthconnection not being available to the various host computing systems110, 130 and 135 in at least some such embodiments.

It will be appreciated that the example of FIG. 1 has been simplifiedfor the purposes of explanation, and that the number and organization ofhost computing systems, server block data storage systems and otherdevices may be much larger than what is depicted in FIG. 1. For example,as one illustrative embodiment, there may be approximately 4000computing systems per data center, with at least some of those computingsystems being host computing systems that may each host 15 virtualmachines, and/or with some of those computing systems being server blockdata storage systems that may each store several volume copies. If eachhosted virtual machine executes one program, then such a data center mayexecute as many as sixty thousand program copies at one time.Furthermore, hundreds or thousands (or more) volumes may be stored onthe server block data storage systems, depending on the number of serverstorage systems, size of the volumes, and number of mirror copies pervolume. It will be appreciated that in other embodiments, other numbersof computing systems, programs and volumes may be used.

FIGS. 2A-2F illustrate examples of providing reliable non-local blockdata storage functionality to clients. In particular, FIGS. 2A and 2Billustrate examples of server block data storage computing systems thatmay be used to provide reliable non-local block data storagefunctionality to clients (e.g., executing programs), such as on behalfof a block data storage service, and FIGS. 2C-2F illustrate examples ofusing archival storage systems to store at least some portions of someblock data storage volumes. In this example, FIG. 2A illustrates severalserver block data storage systems 165 that each store one or more volumecopies 155, such as with each volume having a primary copy and at leastone mirror copy. In other embodiments, other arrangements may be used,as discussed in greater detail elsewhere, such as by having multipleprimary volume copies (e.g., with all of the primary volume copies beingavailable for simultaneous read access by one or more programs) and/orby having multiple mirror volume copies. The example server block datastorage systems 165 and volume copies 155 may, for example, correspondto a subset of the server block data storage systems 165 and volumecopies 155 of FIG. 1.

In this example, the server storage system 165 a stores at least threevolume copies, including the primary copy 155A-a for volume A, a mirrorcopy 155B-a for volume B, and a mirror copy 155C-a for volume C. One ormore other volume copies that are not illustrated in this example mayfurther be stored by the server storage system 165 a, as well as by theother server storage systems 165. Another example server block datastorage system 165 b stores the primary copy 155B-b for volume B in thisexample, as well as a mirror copy 155D-b for volume D. In addition,example server block data storage system 165 n includes a mirror copy155A-n of volume A and a primary copy 155D-n of volume D. Thus, if anexecuting program (not shown) is attached to and using volume A, thenode manager for that executing program will be interacting with serverblock data storage system 165 a to access the primary copy 155A-a forvolume A, such as via server storage system software (not shown) thatexecutes on the server block data storage system 165 a. Similarly, forone or more executing programs (not shown) attached to and using volumesB and D, the node manager(s) for the executing program(s) will interactwith server block data storage systems 165 b and 165 n, respectively, toaccess the primary copies 155B-b for volume B and 155D-n for volume D,respectively. In addition, other server block data storage systems mayfurther be present (e.g., server block data storage systems 165 c-165 mand/or 165 o and beyond), and may store the primary volume copy forvolume C and/or other primary and mirror volume copies, but are notshown in this example. Thus, in this example, each server block datastorage system may store more than one volume copy, and may store acombination of primary and mirror volume copies, although in otherembodiments volumes may be stored in other manners.

FIG. 2B illustrates server block data storage systems 165 similar tothose of FIG. 2A, but at a later point in time after server storagesystem 165 b of FIG. 2A has failed or otherwise become unavailable. Inresponse to the unavailability of server storage system 165 b, and itsstored primary copy of volume B and mirror copy of volume D, the storedvolume copies of the server storage systems 165 of FIG. 2B have beenmodified to maintain availability of volumes B and D. In particular, dueto the unavailability of the primary copy 155B-b of volume B, the priormirror copy 155B-a of volume B on server storage system 165 a has beenpromoted to be the new primary copy for volume B. Thus, if one or moreprograms were previously attached to or otherwise interacting with theprior primary copy 155B-b of volume B when it became unavailable, thoseprograms may have been automatically transitioned (e.g., by nodemanagers associated with those programs) to continue ongoinginteractions with server block data storage system 165 a to access thenew primary copy 155B-a for volume B. In addition, a new mirror copy155B-c for volume B has been created on server storage system 165 c.

While the mirror copy 155A-n of volume A of server storage system 165 nof FIG. 2A is not illustrated in FIG. 2B for the sake of brevity, itcontinues to be available on server storage system 165 n along with theprimary copy 155D-n of volume D, and thus any programs that werepreviously attached to or otherwise interacting with the primary copy155D-n of volume D when server storage system 165 b became unavailablewill continue to interact with that same primary volume D copy 155D-n onserver storage system on server storage system 165 n withoutmodification. However, due to the unavailability of the mirror copy155D-b of volume D on unavailable server storage system 165 b, at leastone additional mirror copy of volume D has been created in FIG. 2B, suchas volume D mirror copy 155D-o of server storage system 165 o. Inaddition, FIG. 2B illustrates that at least some volumes may havemultiple mirror copies, such as volume D that also includes a previouslyexisting (but not shown in FIG. 2A) volume D mirror copy 155D-c onserver storage system 165 c.

FIGS. 2C-2F illustrate examples of using archival storage systems tostore at least some portions of some block data storage volumes. In thisexample, FIG. 2C illustrates multiple server block data storage systems165 that each store one or more volume copies 155, such as to correspondto the example server block data storage systems 165 illustrated in FIG.2A at a time before server block data storage system 165 b becomesunavailable. FIG. 2C further illustrates multiple archival storagesystems 180, which may, for example, correspond to a subset of thecomputing systems 180 of FIG. 1. In particular, in this example, FIG. 2Cillustrates server block data storage systems 165 a and 165 b of FIG.2A, although in this example only the primary and mirror copies ofvolume B are illustrated for those server block data storage systems. Asdiscussed with respect to FIG. 2A, the server storage system 165 bstores the primary copy 155B-b of volume B, and server storage system165 a stores the mirror copy 155B-a of volume B.

In the example of FIG. 2C, a user associated with volume B has requestedthat a new initial snapshot copy of volume B be stored on remotearchival storage systems, such as for long-term backup. Accordingly,volume B has been separated into multiple chunk portions that will eachbe stored separately by the archival storage systems, such as tocorrespond to a typical or maximum storage size for the archival storagesystems, or instead in another manner as determined by the block datastorage service. In this example, the primary copy 155B-b of volume Bhas been separated into N chunks 155B-b1 through 155B-bN, and the mirrorcopy 155B-a of volume B similarly stores the same data using chunks155B-a1 through 155B-aN. Each of the N chunks of volume B is stored as aseparate data object on one of two example archival storage systems 180a and 180 b, and thus those multiple corresponding stored data objectsin aggregate form the initial snapshot volume copy for volume B. Inparticular, chunk 1 155B-b1 of the primary volume B copy is stored asdata object 180B1 on archival storage system 180 a, chunk 2 155B-b2 isstored as data object 180B2 on archival storage system 180 b, chunk 3155B-b3 is stored as data object 180B3 on archival storage system 180 a,and chunk N 155B-bN is stored as data object 180BN on archival storagesystem 180 a. In this example, the separation of volume B into multiplechunks is performed by the block data storage service, such thatindividual chunks of volume B may be individually transferred to thearchival storage systems, although in other embodiments the entirevolume B may instead be sent to the archival storage systems, which maythen separate the volume into multiple chunks or otherwise process thevolume data if so desired.

In addition, in this example, the archival storage system 180 b is anarchival storage computing system that executes an Archival Managermodule 190 to manage operations of the archival storage systems, such asto manage the storage and retrieval of data objects, to track whichstored data objects correspond to which volumes, to separate transferredvolume data into multiple data objects, to meter and otherwise track useof the archival storage systems, etc. The Archival Manager module 190may, for example, maintain a variety of information about the variousdata objects that correspond to a particular volume, such as for eachsnapshot copy of the volume, as discussed in greater detail with respectto FIG. 2F, while in other embodiments such snapshot volume copyinformation may instead be maintained in other manners (e.g., by theserver block data storage systems or other modules of the block datastorage service). In other embodiments, only a single archival storagesystem may be used, or instead the data objects corresponding to chunksof volume B may be stored across many more archival storage systems (notshown). In addition, in other embodiments, each archival storage systemmay execute at least part of an archival manager module, such as foreach archival storage system to have a distinct archival manager module,or for all of the archival storage systems to provide the functionalityof the archival manager module in a distributed peer-to-peer manner. Inother embodiments, one or more archival manager modules may insteadexecute on one or more computing systems that are local to the otherblock data storage service modules (e.g., on the same computing systemor a proximate computing system to one that executes a BDS SystemManager module), or the operations of the archival storage systems mayinstead be managed directly by one or more other modules of the blockdata storage service without using an archival manager module (e.g., bya BDS System Manager module).

Furthermore, in at least some embodiments, the archival storage systemsmay perform various operations to enhance reliability of stored dataobjects, such as to replicate some or all data objects on multiplearchival storage systems. Thus, for example, the other data objects 182b of archival storage system 180 b may include mirror copies of one ormore of the data objects 180B1, 180B3, and 180BN of archival storagesystem 180 a, and the other data objects 182 a of archival storagesystem 180 a may similarly store a mirror copy of data object 180B2 ofarchival storage system 180 b. Furthermore, as discussed in greaterdetail elsewhere, in some embodiments at least some chunks of volume Bmay already be stored on the archival storage systems before the requestto create the initial snapshot copy of volume B is received, such as ifthe data objects stored on the archival storage systems to represent thevolume B chunks are used as a backing store or other remote long-termbackup for volume B. If so, the snapshot copy on the archival storagesystems may instead be created without transferring any additionalvolume data at that time, such as if the data objects on the archivalstorage systems represent a current state of the volume B chunks, whilein other embodiments additional steps may be taken to ensure that thealready stored data objects are up to date with respect to the volume Bchunks.

FIG. 2D continues the example of FIG. 2C, and reflects modifications tovolume B that are performed after the initial snapshot copy is storedwith respect to FIG. 2C. In particular, in this example, after theinitial snapshot volume copy is created, volume B is modified, such asby one or more programs (not shown) that are attached to the volume. Inthis example, data is modified in at least two portions of volume B thatcorrespond to chunk 3 155B-b3 and chunk N 155B-bN of the primary volumeB copy, with the modified chunk data being illustrated as data 3 a andNa, respectively. In this example, after the primary volume B copy155B-b is modified, the server storage system 165 b initiatescorresponding updates of the mirror volume B copy 155B-a on serverstorage system 165 a, such that chunk 3 155B-a3 of the mirror copy ismodified to include the modified 3 a data, and chunk N 155B-aN of themirror copy is modified to include the modified Na data. Thus, themirror volume B copy is maintained in the same state as that of theprimary volume B copy in this example.

In addition, in some embodiments, data on the archival storage systemsmay be further modified to reflect the changes to volume B, even thoughthose new volume B data modifications are not currently part of anysnapshot volume copy for volume B. In particular, since the priorversion of the chunk 3 and chunk N data is part of the initial snapshotvolume copy stored on the archival storage systems, the correspondingdata objects 180B3 and 180BN are not modified to reflect the changes tothe volume B data that occurs subsequent to the initial snapshot volumecopy creation. Instead, if copies are optionally made of the modifiedvolume B data, they are instead stored in this example as additionaldata objects, such as optional data object 180B3 a to correspond to themodified 3 a data of chunk 3 155B-b3, and such as optional data object180BNa to correspond to the modified Na data of chunk N 155B-bN. In thismanner, the data for the initial snapshot volume copy is maintained evenas changes are made to the primary and mirror copies of volume B. If theoptional data objects 180B3 a and 180BNa are created, that creation maybe initiated in various ways, such as by the server block data storagesystem 165 b in a similar manner to the updates that are initiated forthe mirror volume B copy 155B-a.

FIG. 2E illustrates an alternative embodiment with respect to theembodiment described previously with respect to FIG. 2D. In particular,in the example of FIG. 2E, volume B is again modified after the initialsnapshot copy of the volume is stored on the archival storage systems ina manner similar to that discussed with respect to FIG. 2D, andaccordingly the primary copy 155B-b of volume B on server storage system165 b is updated so that chunk 3 155B-b3 and chunk N 155B-bN are updatedto include the modified data 3 a and Na, respectively. However, in thisembodiment, the mirror copy 155B-a of volume B on server storage system165 a is not maintained as a full copy of volume B. Instead, thesnapshot volume copy of volume B on the archival storage systems is usedin conjunction with the mirror copy 155B-a to maintain a copy of volumeB. Thus, in this example, as modifications are made to the primary copy155B-b of volume B after the creation of the initial snapshot volumecopy, those modifications are also made for the mirror copy 155B-a onserver storage system 165 a, such that the mirror copy stores themodified 3 a data for chunk 3 155B-a3 and the modified Na data for chunkN 155B-aN. However, the mirror copy of volume B does not initially storecopies of the other chunks of volume B that have not been modified sincethe initial snapshot volume copy was created, since the snapshot volumecopy of volume B on the archival storage systems includes copies of thatdata. Accordingly, if server storage system 165 b later becomesunavailable, such as previously discussed with respect to FIG. 2B, themirror copy 155B-a of volume B on server storage system 165 a may bepromoted to be the new primary copy of volume B. In order to accomplishthis promotion in this example embodiment, the remaining portions of themirror copy 155B-a of volume B are restored using the initial snapshotvolume copy of volume B on the archival storage systems, such as to usethe stored data object 180B1 to restore chunk 155B-a1, to use the storeddata object 180B2 to restore the chunk 155B-a2, etc. Furthermore, inthis example, the data objects 180B3 a and 180BNa may similarly beoptionally stored on the archival storage systems to represent themodified 3 a and Na data. If so, in some embodiments, the modified 3 aand Na data will also not be initially stored on the server block datastorage system 165 a for the mirror copy 155B-a, and instead the mirrorcopy chunks 155B-a3 and 155B-aN may similarly be restored from thearchival storage system data objects 180B3 a and 180BNa in a mannersimilar to that previously described for the other mirror copy chunks.

While the snapshot volume copy of volume B is used in the prior exampleto restore the mirror copy of volume B when the mirror copy is promotedto be the new primary volume copy, the snapshot volume copy on thearchival storage systems may be used in other manners in otherembodiments. For example, a new copy of volume B that matches theinitial snapshot volume copy may be created using the snapshot volumecopy on the archival storage systems in a manner similar to thatpreviously described for restoring the mirror volume copy, such as tocreate a new mirror copy of volume B as of the time of the snapshotvolume copy, to create an entirely new volume that is based on thesnapshot volume copy of volume B, to assist in moving volume B from oneserver block storage system to another, etc. In addition, when theserver block data storage systems of the block data service areavailable in multiple distinct data centers or other geographicallocations, the remote archival storage systems may be available to allthose server block data storage systems, and thus may be used to createa new volume copy based on a snapshot volume copy in any of thosegeographical locations.

FIG. 2F continues the examples of FIGS. 2C and 2D, continuing at a laterpoint in time after additional modifications are made to volume B. Inparticular, after the modifications to chunk 3 and chunk N are made asdescribed with respect to FIG. 2C or 2D, a second snapshot volume copyof volume B is created on the archival storage systems. Subsequently,additional modifications are made to data in volume B that are stored inchunks 2 and 3. Accordingly, the primary copy of volume B 155B-b asillustrated in FIG. 2F includes original data 1 in chunk 1 155B-b1, data2 a in chunk 2 155B-b2 that is modified subsequent to creation of thesecond snapshot volume copy, data 3 b in chunk 3 155B-b3 that is alsomodified subsequent to creation of the second snapshot volume copy, anddata Na in chunk N 155B-bN that was modified after creation of theinitial first snapshot volume copy but subsequent to creation of thesecond snapshot volume copy. Accordingly, after a third snapshot volumecopy of volume B is indicated to be created, additional data objects arecreated in the archival storage systems to correspond to the two chunksmodified since the creation of the second snapshot volume copy, withdata object 180B2 a corresponding to chunk 155B-b2 and includingmodified data 2 a, and chunk 180B3 b corresponding to chunk 155B-b3 andincluding modified data 3 b.

In addition, in this example, the server block data storage system 165 ais not shown, but a copy of information 250 maintained by the ArchivalManager module 190 (e.g., stored on the archival storage system 180 b orelsewhere) is shown to provide information about snapshot volume copiesstored on the archival storage systems. In particular, in this example,the information 250 includes multiple rows 250 a-250 d, which eachcorrespond to a distinct snapshot volume copy. Each of the rows ofinformation in this example includes a unique identifier for the volumecopy, an indication of the volume to which the snapshot volume copycorresponds, and an indication of an ordered list of the data objectsstored on the archival storage systems that comprise the snapshot volumecopy. Thus, for example, row 250 a corresponds to the initial snapshotvolume copy of volume B discussed with respect to FIG. 2C, and indicatesthat the initial snapshot volume copy includes stored data objects180B1, 180B2, 180B3, and so on through 180BN. Row 250 b corresponds toan example snapshot volume copy for a different volume A that includesvarious stored data objects that are not shown in this example. Row 250c corresponds to the second snapshot volume copy of volume B, and row250 d corresponds to the third snapshot volume copy of volume B. In thisexample, the second and third volume copies for volume B are incrementalcopies rather than full copies, such that chunks of volume B that havenot changed since a prior snapshot volume copy will continue to berepresented using the same stored data objects. Thus, for example, thesecond snapshot copy of volume B in row 250 c indicates that the secondsnapshot volume copy shares data objects 180B1 and 180B2 with that ofthe initial snapshot volume copy of volume B (and possibly some or allof the data objects for chunks 4 through chunks N−1, not shown).Similarly, the third snapshot copy of volume B shown in row 250 d alsocontinues to use the same data object 180B1 as the initial and secondsnapshot volume copies.

By sharing common data objects between multiple snapshot volume copies,the amount of storage on the archival storage systems is minimized,since a new copy of an unchanging volume chunk such as chunk 1 does nothave separate copies on the archival storage systems for each snapshotvolume copy. In other embodiments however, some or all snapshot volumecopies may not be incremental, instead each including a separate copy ofeach volume chunk regardless of whether the data in the chunk haschanged. In addition, when incremental snapshot volume copies are usedthat may share one or more overlapping data objects with one or moreother snapshot volume copies, the overlapping data objects are managedwhen additional types of operations are taken with respect to thesnapshot volume copies. For example, if a request is subsequentlyreceived to delete the initial snapshot volume copy for volume B that isindicated in row 250 a, and to accordingly free up storage space on thearchival storage systems that is no longer needed, only some of the dataobjects indicated for that initial snapshot volume copy may be deletedon the archival storage systems. For example, chunk 3 and chunk N weremodified after the initial snapshot volume copy was created, and thusthe corresponding stored data objects 180B3 and 180BN for the initialsnapshot volume copy are used only by that initial snapshot volume copy.Thus, those two data objects may be permanently deleted from thearchival storage system 180 a if the initial snapshot volume copy ofvolume B is deleted. However, the data objects 180B1 and 180B2 will bemaintained even if that initial snapshot volume copy of volume B isdeleted, since they continue to be a part of at least the secondsnapshot volume copy of volume B.

While not illustrated in this example, the information 250 may include avariety of other types of information about the snapshot volume copies,including information about which archival storage system stores each ofthe data objects, information about who is allowed to access thesnapshot volume copy information and under what circumstances, etc. Asone example, in some embodiments, some users may create snapshot volumecopies and make access to those snapshot volume copies available to atleast some other users in at least some circumstances, such as on afee-based basis to allow the other users to create copies of one or moreparticular snapshot volume copies. If so, such access-relatedinformation may be stored in information 250 or elsewhere, and thearchival manager module 190 may use such information to determinewhether to satisfy requests made for information corresponding toparticular snapshot volume copies. Alternatively, in other embodiments,the access to the snapshot volume copies may instead be managed by othermodules of the block data storage service (e.g., a BBS System Managermodule), such as to prevent requests from being sent to the archivalstorage systems unless those requests are authorized.

It will be appreciated that the examples of FIGS. 2A-2F have beensimplified for the purposes of explanation, and that the number andorganization of server block data storage systems, archival storagesystems, and other devices may be much larger than what is depicted.Similarly, in other embodiments, primary volume copies, mirror volumecopies, and/or snapshot volume copies may be stored and managed in othermanners.

FIG. 3 is a block diagram illustrating example computing systemssuitable for managing the provision and use of reliable non-local blockdata storage functionality to clients. In this example, a servercomputing system 300 executes an embodiment of a BDS System Managermodule 340 to manage provision of non-local block data storagefunctionality to programs executing on host computing systems 370 and/oron at least some other computing systems 390, such as to block datastorage volumes (not shown) provided by the server block data storagesystems 360. Each of the host computing systems 370 in this example alsoexecutes an embodiment of a Node Manager module 380 to manage access ofprograms 375 executing on the host computing system to at least some ofthe non-local block data storage volumes, such as in a coordinatedmanner with the BDS System Manager module 340 over a network 385 (e.g.,an internal network of a data center, not shown, that includes thecomputing systems 300, 360, 370, and optionally at least some of theother computing systems 390). In other embodiments, some or all of theNode Manager modules 380 may instead manage one or more other computingsystems (e.g., other computing systems 390).

In addition, multiple server block data storage systems 360 areillustrated that each store at least some of the non-local block datastorage volumes (not shown) used by the executing programs 375, withaccess to those volumes also provided over the network 385 in thisexample. One or more of the server block data storage systems 360 mayalso each store a server software component (not shown) that managesoperation of one or more of the server block data storage systems 360,as well as various information (not shown) about the data that is storedby the server block data storage systems 360. Thus, in at least someembodiments, the server computing system 300 of FIG. 3 may correspond tothe computing system 175 of FIG. 1, one or more of the Node Managermodules 115 and 125 of FIG. 1 may correspond to the Node Manager modules380 of FIG. 3, and/or one or more of the server block data storagecomputing systems 360 of FIG. 3 may correspond to server block datastorage systems 165 of FIG. 1. In addition, in this example embodiment,multiple archival storage systems 350 are illustrated, which may storesnapshot copies and/or other copies of at least portions of at leastsome block data storage volumes stored on the server block data storagesystems 360. The archival storage systems 350 may also interact withsome or all of the computing systems 300, 360, and 370, and in someembodiments may be remote archival storage systems (e.g., of a remotestorage service, not shown) that interact with the computing systems300, 360, and 370 over one or more other external networks (not shown).

The other computing systems 390 may further include other proximate orremote computing systems of various types in at least some embodiments,including computing systems via which customers or other users of theblock data storage service interact with the computing systems 300and/or 370. Furthermore, one or more of the other computing systems 390may further execute a PES System Manager module to coordinate executionof programs on the host computing systems 370 and/or other hostcomputing systems 390, or computing system 300 or one of the otherillustrated computing systems may instead execute such a PES SystemManager module, although a PES System Manager module is not illustratedin this example.

In this example embodiment, computing system 300 includes a CPU(“central processing unit”) 305, local storage 320, memory 330, andvarious I/O (“input/output”) components 310, with the illustrated I/Ocomponents in this example including a display 311, a network connection312, a computer-readable media drive 313, and other I/O devices 315(e.g., a keyboard, mouse, speakers, microphone, etc.). In theillustrated embodiment, the BDS System Manager module 340 is executingin memory 330, and one or more other programs (not shown) may alsooptionally be executing in memory 330.

Each computing system 370 similarly includes a CPU 371, local storage377, memory 374, and various I/O components 372 (e.g., I/O componentssimilar to I/O components 310 of server computing system 300). In theillustrated embodiment, a Node Manager module 380 is executing in memory374 in order to manage one or more other programs 375 executing inmemory 374 on the computing system, such as on behalf of customers ofthe program execution service and/or block data storage service. In someembodiments, some or all of the computing systems 370 may host multiplevirtual machines, and if so, each of the executing programs 375 may bean entire virtual machine image (e.g., with an operating system and oneor more application programs) executing on a distinct hosted virtualmachine computing node. The Node Manager module 380 may similarly beexecuting on another hosted virtual machine, such as a privilegedvirtual machine monitor that manages the other hosted virtual machines.In other embodiments, the executing program copies 375 and the NodeManager module 380 may execute as distinct processes on a singleoperating system (not shown) executed on computing system 370.

Each archival storage system 350 in this example is a computing systemthat includes a CPU 351, local storage 357, memory 354, and various I/Ocomponents 352 (e.g., I/O components similar to I/O components 310 ofserver computing system 300). In the illustrated embodiment, an ArchivalManager module 355 is executing in memory 354 in order to manageoperation of one or more of the archival storage systems 350, such as onbehalf of customers of the block data storage service and/or of adistinct storage service that provides the archival storage systems. Inother embodiments, the Archival Manager module 355 may instead beexecuting on another computing system, such as one of the othercomputing systems 390 or on computing system 300 in conjunction with theBDS System Manager module 340. In addition, while not illustrated here,in some embodiments various information about the data that is stored bythe archival storage systems 350 may be maintained on storage 357 orelsewhere, such as previously described with respect to FIG. 2F.Furthermore, while also not illustrated here, each of the server blockdata storage systems 360 and/or other computing systems 390 maysimilarly include some or all of the types of components illustratedwith respect to the archival storage systems 350, such as a CPU, localstorage, memory, and various I/O components.

The BDS System Manager module 340 and Node Manager modules 380 may takevarious actions to manage the provision and use of reliable non-localblock data storage functionality to clients (e.g., executing programs),as described in greater detail elsewhere. In this example, the BDSSystem Manager module 340 may maintain a database 325 on storage 320that includes information about volumes stored on the server block datastorage systems 360 and/or on the archival storage systems 350 (e.g.,for use in managing the volumes), and may further store various otherinformation (not shown) about users or other aspects of the block datastorage service. In other embodiments, information about volumes may bestored in other manners, such as in a distributed manner by Node Managermodules 380 on their computing systems and/or by other computingsystems. In addition, in this example, each Node Manager module 380 on ahost computing system 370 may store information 378 on local storage 377about the current volumes attached to the host computing system and usedby the executing programs 375 on the host computing system, such as tocoordinate interactions with the server block data storage systems 360that provide the primary copies of the volumes, and to determine how toswitch to a mirror copy of a volume if the primary volume copy becomesunavailable. While not illustrated here, each host computing system mayfurther include a distinct logical local block data storage deviceinterface for each volume attached to the host computing system and usedby a program executing on the computing system, which may further appearto the executing programs as being indistinguishable from one or moreother local physically attached storage devices that provide localstorage 377.

It will be appreciated that computing systems 300, 350, 360, 370 and 390are merely illustrative and are not intended to limit the scope of thepresent invention. For example, computing systems 300, 350, 360, 370and/or 390 may be connected to other devices that are not illustrated,including through network 385 and/or one or more other networks, such asthe Internet or via the World Wide Web (“Web”). More generally, acomputing node or other computing system or data storage system maycomprise any combination of hardware or software that can interact andperform the described types of functionality, including withoutlimitation desktop or other computers, database servers, network storagedevices and other network devices, PDAs, cellphones, wireless phones,pagers, electronic organizers, Internet appliances, television-basedsystems (e.g., using set-top boxes and/or personal/digital videorecorders), and various other consumer products that include appropriatecommunication capabilities. In addition, the functionality provided bythe illustrated modules may in some embodiments be combined in fewermodules or distributed in additional modules. Similarly, in someembodiments, the functionality of some of the illustrated modules maynot be provided and/or other additional functionality may be available.

It will also be appreciated that, while various items are illustrated asbeing stored in memory or on storage while being used, these items orportions of them may be transferred between memory and other storagedevices for purposes of memory management and data integrity.Alternatively, in other embodiments some or all of the software modulesand/or systems may execute in memory on another device and communicatewith the illustrated computing systems via inter-computer communication.Furthermore, in some embodiments, some or all of the systems and/ormodules may be implemented or provided in other manners, such as atleast partially in firmware and/or hardware, including, but not limitedto, one or more application-specific integrated circuits (ASICs),standard integrated circuits, controllers (e.g., by executingappropriate instructions, and including microcontrollers and/or embeddedcontrollers), field-programmable gate arrays (FPGAs), complexprogrammable logic devices (CPLDs), etc. Some or all of the modules,systems and data structures may also be stored (e.g., as softwareinstructions or structured data) on a computer-readable medium, such asa hard disk, a memory, a network, or a portable media article to be readby an appropriate drive or via an appropriate connection. The systems,modules and data structures may also be transmitted as generated datasignals (e.g., as part of a carrier wave or other analog or digitalpropagated signal) on a variety of computer-readable transmissionmediums, including wireless-based and wired/cable-based mediums, and maytake a variety of forms (e.g., as part of a single or multiplexed analogsignal, or as multiple discrete digital packets or frames). Suchcomputer program products may also take other forms in otherembodiments. Accordingly, the present invention may be practiced withother computer system configurations.

FIG. 4 is a flow diagram of an example embodiment of a Block DataStorage System Manager routine 400. The routine may be provided by, forexample, execution of the Block Data Storage System Manager module 175of FIG. 1 and/or the BDS System Manager module 340 of FIG. 3, such as toprovide a block data storage service for use by executing programs. Inthe illustrated embodiment, the routine may interact with multipleserver block data storage systems at a single data center or othergeographical location (e.g., if each such data center or othergeographical location has a distinct embodiment of the routine executingat the geographical location), although in other embodiments a singleroutine 400 may support multiple distinct data centers or othergeographical locations.

The illustrated embodiment of the routine begins at block 405, where arequest or other information is received. The routine continues to block410 to determine whether the received request was to create a new blockdata storage volume, such as from a user of the block data storageservice and/or from an executing program that would like to access thenew volume, and if so continues to block 415 to perform the volumecreation. In the illustrated embodiment, the routine in block 415selects one or more server block data storage system on which copies ofthe volume will be stored (e.g., based at least in part on location ofand/or capabilities of the selected server storage systems), initializesthe volume copies on those selected server storage systems, and updatesstored information about volumes to reflect the new volume. For example,in some embodiments, the creation of a new volume may includeinitializing a specified size of linear storage on each of the selectedservers in a specified manner, such as to be blank, to include a copy ofanother indicated volume (e.g., another volume at the same data centeror other geographical location, or instead a volume stored at a remotelocation), to include a copy of an indicated snapshot volume copy (e.g.,a snapshot volume copy stored by one or more archival storage systems,such as by interacting with the archival storage systems to obtain thesnapshot volume copy), etc. In other embodiments, a logical block oflinear storage of a specified size for a volume may be created on one ormore server block data storage systems, such as by using multiplenon-contiguous storage areas that are presented as a single logicalblock and/or by striping a logical block of linear storage acrossmultiple local physical hard disks. To create a copy of a volume thatalready exists at another data center or other geographical location,the routine may, for example, coordinate with another instance of theroutine 400 that supports block data storage service operations at thatlocation. Furthermore, in some embodiments, at least some volumes willeach have multiple copies that include at least one primary volume copyand one or more mirror copies on multiple distinct server storagesystems, and if so multiple server storage systems may be selected andinitialized.

If it is instead determined in block 410 that the received request isnot to create a volume, the routine continues instead to block 420 todetermine whether the received request is to attach an existing volumeto an executing program copy, such as a request received from theexecuting program copy or from another computing system operated onbehalf of a user associated with the executing program copy and/or theindicated volume. If so, the routine continues to block 425 to identifyat least one of the server block data storage systems that stores a copyof the volume, and to associate at least one of the identified serverstorage systems with the executing program (e.g., to associate theprimary server storage system for the volume with the computing node onwhich the program executes, such as by causing a logical local blockstorage device to be mounted on the computing node that represents theprimary volume copy). The volume to be attached may be identified invarious ways, such as by a unique identifier for the volume and/or anidentifier for a user who created or is otherwise associated with thevolume. After attaching the volume to the executing program copy, theroutine may further update stored information about the volume toindicate the attachment of the executing program, such as if only asingle program is allowed to be attached to the volume at a time, or ifonly a single program is allowed to have write access or othermodification access to the volume at a time. In addition, in theindicated embodiment, information about at least one of the identifiedserver storage systems may be provided to a node manager associated withthe executing program, such as to facilitate the actual attachment ofthe volume to the executing program, although in other embodiments thenode manager may have other access to such information.

If it is instead determined in block 420 that the received request isnot to attach a volume to an executing program, the routine continuesinstead to block 430 to determine whether the received request is tocreate a snapshot copy for an indicated volume, such as a requestreceived from an executing program that is attached to the volume orinstead another computing system (e.g., a computing system operated by auser associated with the volume and/or a user who has purchased accessto create a snapshot copy of another user's volume). In someembodiments, a snapshot volume copy may be created of a volumeregardless of whether the volume is attached or in use by any executingprograms, and/or regardless of whether the volume is stored at the samedata center or other geographical location at which the routine 400executes. If it is determined so, the routine continues to block 435 toinitiate creation of a snapshot volume copy of the indicated volume,such as by interacting with one or more archival manager modules thatcoordinate operations of one or more archival storage systems (e.g.,archival storage systems at a remote storage location, such as inconjunction with a remote long-term storage service that is accessibleover one or more networks. In some embodiments, the snapshot volume copycreation may be performed by a third-party remote storage service inresponse to an instruction from the routine 400, such as if the remotestorage service already stores at least some chunks of the volume.Furthermore, various other parameters may further be specified in atleast some embodiments, such as whether the snapshot volume copy is tobe incremental with respect to one or more other snapshot volume copies,etc.

If it is instead determined in block 430 that the received request isnot to create a snapshot volume copy, the routine continues instead toblock 440 to determine whether the information received in block 405 isan indication of failure or other unavailability of one or more serverblock data storage systems (or of one or more volumes, in otherembodiments). For example, as described below with respect to block 485,the routine may in some embodiments monitor the status of some or all ofthe server block data storage systems and determine unavailability onthat basis, such as by periodically or constantly sending ping messagesor other messages to server block data storage systems to determine if aresponse is received, or by otherwise obtaining information about thestatus of the server storage systems. If it is determined in block 440that the received information indicates the possible failure of one ormore server storage systems, the routine continues to block 445 to takeactions to maintain the availability of the one or more volumes storedon the indicated one or more server storage systems. In particular, theroutine in block 445 determines whether any such volumes stored on theindicated one or more server storage systems are primary volume copies,and for each such primary volume copy, promotes one of the mirror copiesfor that volume on another server storage system to be the new primarycopy for that volume. In block 450, the routine then causes at least onenew copy of each volume to be replicated on one or more other serverstorage systems, such as by using an existing copy of the volume that isavailable on a server storage system other than one of those indicatedto be unavailable. In other embodiments, the promotion of mirror copiesto primary copies and/or the creation of new mirror copies may insteadbe performed in other manners, such as in a distributed manner by theserver block data storage systems (e.g., using an election protocolbetween the mirror copies of a volume). In addition, in some embodimentsthe mirror volume copies may be minimal copies that include onlyportions of a primary copy of a volume (e.g., only portions that havebeen modified since a snapshot copy of the volume was previouslycreated), and the promotion of a mirror copy to a primary copy mayfurther include gathering information for the new primary copy to makeit complete (e.g., from the most recent snapshot copy).

In block 455, the routine then optionally initiates attachments of oneor more executing programs to any new primary volume copies that werepromoted from mirror copies, such as for executing programs that werepreviously attached to primary volume copies on the one or moreunavailable server storage systems, although in other embodiments suchre-attachment to new primary volume copies will instead be performed inother manners (e.g., by a node manager associated with the executingprogram for which the re-attachment will occur). In block 458, theroutine then updates information about the volumes on the unavailableserver storage systems, such as to indicate the new volume copiescreated in block 450 and the new primary volume copies promoted in block445. In other embodiments, new primary volume copies may be created inother manners, such as by creating a new volume copy as a primary volumecopy, rather than promoting an existing mirror volume copy, althoughdoing so may take longer than promoting an existing mirror volume copy.In addition, if no volume copies are available from which to replicatenew volume copies in block 450, such as if multiple server storagesystems that store the primary and mirror copies for a volume all failsubstantially simultaneously, the routine may in some embodimentsattempt to obtain information for the volume to use in such replicationin other manners, such as from one or more recent snapshot volume copiesfor the volume that are available on archival storage systems, from acopy of the volume at another data center or other geographicallocation, etc.

If it is instead determined in block 440 that the received informationis not an indication of failure or other unavailability of one or moreserver block data storage systems, the routine continues instead toblock 460 to determine whether information received in block 405indicates to move one or more volumes to one or more new server blockdata storage systems. Such volume movement may be performed for avariety of reasons, as discussed in greater detail elsewhere, includingto other server block data storage systems at the same geographicallocation (e.g., to move existing volumes to storage systems that arebetter equipped to support the volumes) and/or to one or more serverdata storage systems at one or more other data centers or othergeographical locations. In addition, movement of a volume may beinitiated in various ways, such as due to a request from a user of theblock data storage system that is associated with the volume, to arequest from a human operator of the block data storage service, basedon an automated detection of a better server storage system for a volumethan the current server storage system being used (e.g., due toover-utilization of the current server storage system and/orunder-utilization of the new server storage system), etc. If it isdetermined in block 460 that the received information is to move one ormore such volume copies, the routine continues to block 465 and createsa copy of each indicated volume on one or more new server block datastorage systems, such as in a manner similar to that previouslydiscussed with respect to block 415 (e.g., by using an existing volumecopy on a server block data storage system, by using a snapshot or othercopy of the volume on one or more archival storage systems, etc.), andfurther updates stored information for the volume in block 465. Inaddition, in some embodiments the routine may take additional actions tosupport the movement, such as to delete the prior volume copy from aserver block data storage system after the new volume copy is created.Furthermore, in situations in which one or more executing programs wereattached to the prior volume copy being moved, the routine may initiatethe detachment of the prior volume copy being moved for an executingprogram and/or may initiate a re-attachment of such an executing programto the new volume copy being created, such as by sending associatedinstructions to a node manager for the executing program, although inother embodiments the node manager may instead perform such actions.

If it is instead determined in block 460 that the received informationis not an instruction to move one or more volumes, the routine continuesinstead to block 485 to perform one or more other indicated operationsas appropriate. Other operations may have various forms in variousembodiments, such as one or more of the following non-exclusive list: toperform monitoring of some or all server block data storage systems(e.g., by sending ping messages or other status messages to the serverblock data storage systems and waiting for a response); by initiatingcreation of a replacement primary volume copy and/or mirror volume copyin response to determining that a primary or mirror copy of a volume isunavailable, such as based on monitoring that is performed, on a messagereceived from a primary server block data storage system that stores aprimary copy of a volume but is unable to update one or more mirrorcopies of that volume, on a message received from a node manager module,etc.; detaching, deleting, and/or describing one or more volumes;deleting, describing and/or copying one or more snapshot volume copies;tracking use of volumes and/or snapshot volume copies by users, such asto meter such usage for payment purposes; etc. After blocks 415, 425,435, 458, 465, or 485, the routine continues to block 495 to determinewhether to continue, such as until an explicit termination instructionis received. If so, the routine returns to block 405, and if not theroutine continues to block 499 and ends.

In addition, for at least some types of requests, the routine may insome embodiments further verify that the requester is authorized to makethe request, such as based on access rights specified for the requesterand/or an associated target of the request (e.g., an indicated volume).In some such embodiments, the verification of authorization may furtherinclude obtaining payment from the requester for the requestedfunctionality (or verifying that any such payment has already beenprovided), such as to not perform the request if the payment is notprovided. For example, types of request that may have associated paymentin at least some embodiments and situations include requests to create avolume, attach a volume, create a snapshot copy, move an indicatedvolume (e.g., to a premium server storage system), and other types ofindicated operations. Furthermore, some or all types of actions taken onbehalf of users may be monitored and metered, such as for later use indetermining corresponding usage-based fees for at least some of thoseactions.

FIG. 5 is a flow diagram of an example embodiment of a Node Managerroutine 500. The routine may be provided by, for example, execution of aNode Manager module 115 and/or 125 of FIG. 1, and/or execution of a NodeManager module 380 of FIG. 3, such as to manage the use by one or moreexecuting programs of non-local block data storage. In the illustratedembodiment, the block data storage service provides functionalitythrough a combination of one or more BDS System Manager modules andmultiple Node Manager modules and optionally one or more ArchivalManager modules, although in other embodiments other configurations maybe used (e.g., a single BDS System Manager module without any NodeManager modules and/or Archival Manager modules, multiple Node Managermodules executing together in a coordinated manager without a BDS SystemManager module, etc.).

The illustrated embodiment of the routine begins in block 505, where arequest is received related to program execution on an associatedcomputing node. The routine continues to block 510 to determine whetherthe request is related to executing one or more indicated programs on anindicated associated computing node, such as a request from a programexecution service and/or a user associated with those programs. If so,the routine continues to block 515 to obtain a copy of the indicatedprogram(s) and to initiate execution of the program(s) on an associatedcomputing node. In some embodiments, the one or more indicated programsmay be obtained in block 515 based on the indicated programs being sentto the routine 500 as part of the request received in block 505, whilein other embodiments the indicated programs may be retrieved from localor non-local storage (e.g., from a remote storage service). In otherembodiments, the routine 500 may instead not perform operations relatedto executing programs, such as if another routine that supports theprogram execution service instead performs those operations on behalf ofassociated computing nodes.

If it is instead determined in block 510 that the received request isnot to execute one or more indicated programs, the routine continuesinstead to block 520 to determine whether a request is received toattach an indicated volume to an indicated executing program, such asfrom the executing program, from the routine 400 of FIG. 4, and/or froma user associated with the indicated volume and/or the indicatedexecuting program. If so, the routine continues to block 525 to obtainan indication of a primary copy of the volume, and to associate thatprimary volume copy with a representative logical local block datastorage device for the computing node. In some embodiments, therepresentative local logical block data storage device may be indicatedto the executing program and/or computing node by the routine 500, whilein other embodiments, the executing program may instead initiate thecreation of the local logical block data storage device. For example, insome embodiments the routine 500 may use GNBD (“Global Network BlockDevice”) technology to make the logical local block data storage deviceavailable to a virtual machine computing node by importing a blockdevice into a particular virtual machine and mounting that logical localblock data storage device. In some embodiments, the routine may takefurther actions at block 525, such as to obtain and store indications ofone or more mirror volume copies for the volume, such as to allow theroutine to dynamically attach to a mirror volume copy if the primaryvolume copy later becomes unavailable.

If it is instead determined in block 520 that the received request ofblock 505 is not to attach an indicated volume, the routine continuesinstead to block 530 to determine whether the received request is a dataaccess request by an executing program for an attached volume, such as aread request or a write request. If so, the routine continues to block535, where the routine identifies the associated primary volume copythat corresponds to the data access request (e.g., based on therepresentative local logical block data storage device used by theexecuted program for the data access request), and initiates therequested data access to the primary volume copy. As discussed ingreater detail elsewhere, in some embodiments a lazy write scheme may beused, such as by immediately modifying the actual primary and/or mirrorvolume copies to reflect a write data access request (e.g., to alwaysupdate the mirror volume copy, to update a mirror volume copy only ifthe mirror volume copy is being promoted to be the primary volume copy,etc.), but not immediately modifying a corresponding chunk stored on oneor more archival storage systems to reflect the write data accessrequest (e.g., so as to eventually update the copy stored on thearchival storage systems when sufficient modifications have been madeand/or when read access to corresponding information is requested,etc.). In the illustrated embodiment, the maintaining of mirror volumecopies is performed by a routine other than the routine 500 (e.g., bythe primary server block data storage system that stores the primaryvolume copy), although in other embodiments the routine 500 may in block535 further assist in maintaining one or more of the mirror volumecopies by sending similar or identical data access requests to thosemirror volume copies. Furthermore, in some embodiments a volume may notbe stored on the archival storage systems until explicitly requested bya corresponding user (e.g., as part of a request to create a snapshotcopy of the volume), while in other embodiments a copy may be maintainedon the archival storage systems of at least some portions of at leastsome volumes (e.g., if the archival storage systems' copy is used as abacking store for the primary and/or mirror volume copies).

After block 535, the routine continues to block 540 to determine whethera response is received from the primary server block data storage systemfor the request sent in block 535 within a predefined time limit, suchas to indicate success of the operation. If not, the routine determinesthat the primary server block data storage system is unavailable, andcontinues to block 545 to initiate a change to attach one of the mirrorvolume copies as the new primary volume copy, and to associate theserver block data storage system for that mirror volume copy as the newprimary server block data storage system for the volume. Furthermore,the routine similarly sends the data access request to the new primaryvolume copy in a manner similar to that indicated above with respect toblock 535, and may further in some embodiments monitor whether anappropriate response is received and proceed to block 545 again if not(e.g., to promote another mirror volume copy and repeat the process). Insome embodiments, the initiating of the change to a mirror volume copyas a new primary volume copy may be performed in coordination withroutine 400, such as by initiating contact with routine 400 to determinewhich mirror volume copy should become the new primary volume copy, byreceiving instructions from routine 400 when a mirror volume copy ispromoted to be a primary volume copy by the routine 500 (e.g., asprompted by an indication sent by the routine 500 in block 545 that theprimary volume copy is unavailable), etc.

If it is instead determined in block 530 that the received request isnot a data access request for an attached volume, the routine continuesinstead to block 585 to perform one or more other indicated operationsas appropriate. The other operations may have various forms in variousembodiments, such as instructions from routine 400 of new volumeinformation for one or more volumes (e.g., a new promoted primary volumecopy for a volume to which one or more executing programs being managedare attached), to detach a volume from an executing program on acomputing node associated with the routine 500, etc. In addition, in atleast some embodiments, the routine 500 may further perform one or moreother actions of a virtual machine monitor, such as if the routine 500operates as part of or otherwise in conjunction with a virtual machinemonitor that manages one or more associated virtual machine computingnodes.

After blocks 515, 525, 545, or 585, or if it is instead determined inblock 540 that a response is received within a predefined time limit,the routine continues to block 595 to determine whether to continue,such as until an explicit termination instruction is received. If so,the routine returns to block 505, and if not continues to block 599 andends.

In addition, for at least some types of requests, the routine may insome embodiments further verify that the requester is authorized to makethe request, such as based on access rights specified for the requesterand/or an associated target of the request (e.g., an indicated volume).In some such embodiments, the verification of authorization may furtherinclude obtaining payment from the requester for the requestedfunctionality (or verifying that any such payment has already beenprovided), such as to not perform the request if the payment is notprovided. For example, types of request that may have associated paymentin at least some embodiments and situations include requests to executeindicated programs, attach a volume, perform some or all types of dataaccess requests, and other types of indicated operations. Furthermore,some or all types of actions taken on behalf of users may be monitoredand metered, such as for later use in determining correspondingusage-based fees for at least some of those actions.

FIG. 6 is a flow diagram of an example embodiment of a Server Block DataStorage System routine 600. The routine may be provided by, for example,execution of a software component on a server block data storage system,such as to manage the storage of block data on one or more block datastorage volumes on that server storage system (e.g., for server blockdata storage systems 165 of FIG. 1 and/or of FIG. 2). In otherembodiments, some or all of the functionality of the routine may beprovided in other manners, such as by software executing on one or moreother computing systems to manage one or more server block data storagesystems.

The illustrated embodiment of the routine begins at block 605, where arequest is received. The routine continues to block 610 to determinewhether the received request is related to creating a new volume, suchas by associating a block of available storage space for the serverstorage system (e.g., storage space on one or more local hard disks)with a new indicated volume. The request may, for example, be fromroutine 400 and/or from a user associated with the new volume beingcreated. If so, the routine continues to block 615 to store informationabout the new volume, and in block 620 initializes storage space for thenew volume (e.g., a logical linear block of storage space of anindicated size). As discussed in greater detail elsewhere, in someembodiments new volumes may be created based on another existing volumeor snapshot volume copy, and if so the routine may in block 620initialize the storage space for the new volume by copying appropriatedata to the storage space, while in other embodiments may initialize newvolume storage space in other manners (e.g., such as to initialize thestorage space to a default value, such as all zeros).

If it is instead determined in block 610 that the received request isnot to create a new volume, the routine continues instead to block 625to determine whether a data access request has been received for anexisting volume stored on the server storage system, such as from a nodemanager associated with an executing program that initiated the dataaccess request. If so, the routine continues to block 630 to perform thedata access request on the indicated volume. The routine then continuesto block 635 to, in the illustrated embodiment, optionally initiatecorresponding updates for one or more mirror copies of the volume, suchas if the indicated volume on the current server storage system is theprimary volume copy for the volume. In other embodiments, consistencybetween a primary volume copy and mirror volume copies may be maintainedin other manners. As discussed in greater detail elsewhere, in someembodiments, at least some modifications to the stored data contents ofat least some volumes may also be performed to one or more archivalstorage systems (e.g., at a remote storage service), such as to maintaina backing copy or other copy of those volumes, and if so the routine mayfurther initiate updates to the archival storage systems to initiatecorresponding updates for one or more copies of the volume on thearchival storage systems. Furthermore, if the routine determines inblock 635 or elsewhere that a mirror copy of the volume is not available(e.g., based on a failure to respond within a predefined amount of timeto a data access request sent in block 635, or to a ping message orother status message initiated by the routine 600 to periodically checkthat the mirror volume copy and its mirror server block data storagesystem are available; based on a message from the mirror server blockdata storage system that it has suffered an error condition or has beguna shutdown or failure mode operation; etc.), the routine may initiateactions to create a new mirror copy of a volume, such as by sending acorresponding message to the routine 400 of FIG. 4 or instead bydirectly initiating the mirror volume copy creation.

If it is instead determined in block 625 that the received request isnot a data access request for a volume, the routine continues to block685 to perform one or more other indicated operations as appropriate.Such other operations may have various forms in various embodiments,such as one or more of the following non-exclusive list: to delete avolume (e.g., so as to make the associated storage space available forother use); to copy a volume to an indicated destination (e.g., toanother new volume on another server block data storage system, to oneor more archival storage systems for use as a snapshot volume copy,etc.); to provide information about use of volumes (e.g., for meteringof volume use, such as for fee-based volume use by customers); toperform ongoing maintenance or diagnostics for the server block datastorage system (e.g., to defragment local hard disks); etc. After blocks620, 635, or 685, the routine continues to block 695 to determinewhether to continue, such as until an explicit termination instructionis received. If so, the routine returns to block 605, and if notcontinues to block 699 and ends.

In addition, for at least some types of requests, the routine may insome embodiments further verify that the requester is authorized to makethe request, such as based on access rights specified for the requesterand/or an associated target of the request (e.g., an indicated volume),while in other embodiments the routine may assume that requests havebeen previously authorized by a routine from which it receives requests(e.g., a Node Manager routine and/or a BDS System Manager routine).Furthermore, some or all types of actions taken on behalf of users maybe monitored and metered, such as for later use in determiningcorresponding usage-based fees for at least some of those actions.

FIGS. 7A and 7B are a flow diagram of an example embodiment of a PESSystem Manager routine 700. The routine may be provided by, for example,execution of a PES System Manager module 140 of FIG. 1. In otherembodiments, some or all of the functionality of the routine 700 mayinstead be provided in other manners, such as by routine 400 as part ofthe block data storage service.

In the illustrated embodiment, the routine begins at block 705, where astatus message or other request related to the execution of a program isreceived. The routine continues to block 710 to determine the type ofthe received message or request. If it is determined in block 710 thatthe type is a request to execute a program, such as from a user orexecuting program, the routine continues to block 720 to select one ormore host computing systems on which to execute the indicated program,such as from a group of candidate host computing systems available forprogram execution. In some embodiments, the one or more host computingsystems may be selected in accordance with user instructions or otherindicated criteria of interest. The routine then continues to block 725to initiate execution of the program by each of the selected hostcomputing systems, such as by interacting with a Node Manager associatedwith the selected host computing system. In block 730, the routine thenoptionally performs one or more housekeeping tasks (e.g., monitoringprogram execution by users, such as for metering and/or other billingpurposes).

If it is instead determined in block 710 that the received request is toregister a new program as being available for later execution, theroutine continues instead to block 740 to store an indication of theprogram and associated administrative information for its use (e.g.,access control information related to users who are authorized to usethe program and/or authorized types of uses), and may further store atleast one centralized copy of the program in some situations. Theroutine then continues to block 745 to optionally initiate distributionof copies of the indicated program to one or more host computing systemsfor later use, such as to allow rapid startup of the program by thosehost computing systems by retrieving the stored copy from local storageof those host computing systems. In other embodiments, one or morecopies of the indicated program may be stored in other manners, such ason one or more remote archival storage systems.

If it instead determined in block 710 that a status message is receivedin block 705 concerning one or more host computing systems, the routinecontinues instead to block 750 to update information concerning thosehost computing systems, such as to track usage of executing programsand/or other status information about host computing systems (e.g., useof non-local block data storage volumes). In some embodiments, statusmessages will be sent periodically by node manager modules, while inother embodiments, status messages may be sent at other times (e.g.,whenever a relevant change occurs). In yet other embodiments, theroutine 700 may instead request information from node manager modulesand/or host computing systems as desired. Status messages may include avariety of types of information, such as the number and identity ofprograms currently executing on a particular computing system, thenumber and identity of copies of programs currently stored in the localprogram repository on a particular computing system, attachments and/orother use of non-local block data storage volumes, performance-relatedand resource-related information (e.g., utilization of CPU, network,disk, memory, etc.) for a computing system, configuration informationfor a computing system, and reports of error or failure conditionsrelated to hardware or software on a particular computing system.

If the routine instead determines in block 705 that another type ofrequest or message is received, the routine continues instead to block785 to perform one or more other indicated operations as appropriate.Such other operations may include, for example, suspending orterminating execution of currently executing programs, and otherwisemanaging administrative aspects of the program execution service(registration of new users, determining and obtaining of payment for useof the program execution service, etc.). After blocks 745, 750 or 785,the routine continues to block 730 to optionally perform one or morehousekeeping tasks. The routine then continues to block 795 to determinewhether to continue, such as until an explicit termination instructionis received. If so, the routine returns to block 705, and if notcontinues to block 799 and ends.

While not illustrated here, in at least some embodiments, a variety ofadditional types of functionality to execute programs may be provided bya program execution service, such as in conjunction with a block datastorage service. In at least some embodiments, the execution of one ormore copies or instances of a program on one or more computing systemsmay be initiated in response to a current execution request forimmediate execution of those program instances. Alternatively, theinitiation may be based on a previously received program executionrequest that scheduled or otherwise reserved the then-future executionof those program instances for the now-current time. Program executionrequests may be received in various ways, such as directly from a user(e.g., via an interactive console or other GUI provided by the programexecution service), or from an executing program of a user thatautomatically initiates the execution of one or more instances of otherprograms or of itself (e.g., via an API provided by the programexecution service, such as an API that uses Web services). Programexecution requests may include various information to be used in theinitiation of the execution of one or more instances of a program, suchas an indication of a program that was previously registered orotherwise supplied for future execution, and a number of instances ofthe program that are to be executed simultaneously (e.g., expressed as asingle desired number of instances, as a minimum and maximum number ofdesired instances, etc.). In addition, in some embodiments, programexecution requests may include various other types of information, suchas the following: an indication of a user account or other indication ofa previously registered user (e.g., for use in identifying a previouslystored program and/or in determining whether the requested programinstance execution is authorized); an indication of a payment source foruse in providing payment to the program execution service for theprogram instance execution; an indication of a prior payment or otherauthorization for the program instance execution (e.g., a previouslypurchased subscription valid for an amount of time, for a number ofprogram execution instances, for an amount of resource utilization,etc.); and/or an executable or other copy of a program to be executedimmediately and/or stored for later execution. In addition, in someembodiments, program execution requests may further include a variety ofother types of preferences and/or requirements for execution of one ormore program instances. Such preferences and/or requirements may includeindications that some or all of the program instances be executed in anindicated geographical and/or logical location, such as in one ofmultiple data centers that house multiple computing systems availablefor use, on multiple computing systems that are proximate to each other,and/or on one or more computing system that are proximate to computingsystems having other indicated characteristics (e.g., that provide acopy of an indicated block data storage volume).

FIG. 8 is a flow diagram of an example embodiment of an Archival Managerroutine 800. The routine may be provided by, for example, execution ofone of the Archival Manager modules 355 of FIG. 3, of the ArchivalManager module 190 of FIGS. 2C-2F and/or of one or more archival managermodules (not shown) on the computing systems 180 of FIG. 1. In otherembodiments, some or all of the functionality of the routine 800 mayinstead be provided in other manners, such as by routine 400 as part ofthe block data storage service. In the illustrated embodiment, thearchival storage systems store data in chunks that each correspond to aportion of a block data storage volume, but in other embodiments maystore data in other manners.

The illustrated embodiment of the routine 800 begins in block 805, whereinformation or a request is received. The routine then continues toblock 810 to determine if the request or information is authorized, suchas if the requester has provided payment for fee-based access, orotherwise has access rights to have an indicated request be performed.If it is determined in block 815 that the request or information isauthorized, the routine continues to block 820, and otherwise returns toblock 805. In block 820, the routine determines if the received requestis to store a new snapshot copy for an indicated volume. If so, theroutine continues to block 825 to obtain multiple volume chunks for thevolume, store each chunk as an archival storage system data object, andthen store information about the data objects for the chunks that areassociated with the snapshot volume copy. As discussed in greater detailelsewhere, the chunks of the volume may be obtained in various ways,such as by being received in block 805 as multiple distinct blocks,received in block 805 as a single large group of block data that isseparated into chunks in block 825, retrieved in block 825 as individualchunks or a single large group of block data to be separated intochunks, previously stored on the archival storage systems, etc.

If it is instead determined in block 820 that the received request isnot to store a new snapshot volume copy, the routine continues insteadto block 830 to determine whether the received request is to store anincremental snapshot copy of a volume that reflects changes from a priorsnapshot volume copy. If so, the routine continues to block 835 toidentify snapshot chunks that have changed since a prior snapshot copyof the volume, and to obtain copies of the changed snapshot chunks in amanner similar to that previously discussed with respect to block 825.The routine then continues to block 840 to store copies of the changedchunks, and to store information about the new changed chunks and theprior other unchanged chunks whose corresponding data objects areassociated with the new snapshot volume copy. The chunks that havechanged since a prior snapshot volume copy may be identified in variousways, such as by the server block data storage systems that storeprimary and/or mirror copies of the volume (e.g., by tracking any writedata access requests or other modification requests for the volume).

If it is instead determined in block 830 that the received request isnot to store an incremental snapshot volume copy, the routine continuesinstead to block 845 to determine whether the request is to provide oneor more chunks of a snapshot volume copy, such as from correspondingstored data objects. If so, the routine continues to block 850 toretrieve the data for the indicated snapshot volume copy chunk(s), andsends the retrieved data to the requester. Such requests may be, forexample, part of creating a new volume based on an existing snapshotvolume copy by retrieving all of the chunks for the snapshot volumecopy, part of retrieving a subset of a snapshot volume copy's chunks torestore a minimal mirror volume copy, etc.

If it is instead determined in block 845 that the received request isnot to provide one or more snapshot volume copy chunks, the routinecontinues to block 855 to determine if the received request is toperform one or more data access requests for one or more volume chunksthat are not part of a snapshot volume copy, such as to perform readdata access requests and/or write data access requests for one or moredata objects that represent particular volume chunks (e.g., if thosestored data objects serve as a backing store for those volume chunks).If so, the routine continues to block 860 to perform the requested dataaccess request(s) for the stored data object(s) corresponding to theindicated volume chunk(s). As discussed in greater detail elsewhere, inat least some embodiments, lazy updating techniques may be used whenmodifying stored data objects, such that a write data access request maynot be immediately performed. If so, before a later read data accessrequest for the same data object is completed, the one or more precedingwrite data access requests may be performed to ensure strict dataconsistency.

If it is instead determined in block 855 that the received request isnot to perform data access requests for one or more volume chunks, theroutine continues instead to block 885 to perform one or more otherindicated operations as appropriate. Such other operations may include,for example, repeatedly receiving information that corresponds tomodifications being performed on a volume in order to updatecorresponding stored data objects that represent the volume (e.g., as abacking store or for other purposes) and taking appropriatecorresponding actions, responding to requests to delete or otherwisemodify stored snapshot volume copies, responding to requests of a userto manage an account with a storage service that provides the archivalstorage systems, etc. After blocks 825, 840, 850, 860, or 885, theroutine continues to block 895 to determine whether to continue, such asuntil an explicit termination instruction is received. If so, theroutine returns to block 805, and if not continues to block 899 andends.

As noted above, for at least some types of requests, the routine may insome embodiments verify that the requester is authorized to make therequest, such as based on access rights specified for the requesterand/or an associated target of the request (e.g., an indicated volume orsnapshot volume copy), while in other embodiments the routine may assumethat requests have been previously authorized by a routine from which itreceives requests (e.g., a Node Manager routine and/or a BDS SystemManager routine). Furthermore, some or all types of actions taken onbehalf of users may be monitored and metered in at least someembodiments, such as for later use in determining correspondingusage-based fees for at least some of those actions.

Additional details related to the operation of example embodiments of aprogram execution service with which the described techniques may beused are available in U.S. patent application Ser. No. 11/395,463, filedMar. 31, 2006 and entitled “Managing Execution Of Programs By MultipleComputing Systems,” now U.S. Pat. No. 8,190,682; in U.S. patentapplication Ser. No. 11/851,345, filed Sep. 6, 2007 and entitled“Executing Programs Based on User-Specified Constraints,” now U.S. Pat.No. 7,792,944, which is a continuation-in-part of U.S. patentapplication Ser. No. 11/395,463, now U.S. Pat. No. 8,190,682; and U.S.application Ser. No. 12/145,411, filed Jun. 24, 2008 and entitled“Managing Communications Between Computing Nodes,” now U.S. Pat. No.9,369,302; each of which is incorporated herein by reference in itsentirety. In addition, additional details related to the operation ofone example of a remote storage service that may be used to storesnapshot volume copies or otherwise provide remote archival storagesystems are available in U.S. Patent Application Publication No.2007/0156842, published Jul. 5, 2007 and entitled “Distributed StorageSystem With Web Services Client Interface,” now U.S. Pat. No. 7,716,180,which is incorporated herein by reference in its entirety, and whichclaims priority of U.S. Patent Application No. 60/754,726, filed Dec.29, 2005. Furthermore, additional details related to one example ofusers providing paid access to the users' programs or other data forother users are available in U.S. patent application Ser. No.11/963,331, filed Dec. 21, 2007 and entitled “Providing ConfigurablePricing for Execution of Software Images,” now U.S. Pat. No. 8,788,379,which is incorporated herein by reference in its entirety, and which maysimilarly be used herein for users to charge other users for varioustypes of paid access to volumes and/or snapshot copies, as discussed ingreater detail elsewhere.

In addition, as previously noted, some embodiments may employ virtualmachines, and if so the programs to be executed by the program executionservice may include entire virtual machine images. In such embodiments,a program to be executed may comprise an entire operating system, a filesystem and/or other data, and possibly one or more user-level processes.In other embodiments, a program to be executed may comprise one or moreother types of executables that interoperate to provide somefunctionality. In still other embodiments, a program to be executed maycomprise a physical or logical collection of instructions and data thatmay be executed natively on the provided computing system or indirectlyby means of interpreters or other software-implemented hardwareabstractions. More generally, in some embodiments, a program to beexecuted may include one or more application programs, applicationframeworks, libraries, archives, class files, scripts, configurationfiles, data files, etc.

In addition, as previously noted, in at least some embodiments andsituations, volumes may be migrated or otherwise moved from one serverstorage system to another. Various techniques may be used to movevolumes, and such movement may be initiated in various manners. In somesituations, the movement may reflect problems related to the serverstorage systems on which the volumes are stored (e.g., failure of theserver storage systems and/or of network access to the server storagesystems). In other situations, the movement may be performed toaccommodate other volume copies to be stored on existing server storagesystems, such as for higher-priority volumes, or to consolidate thestorage of volume copies on a limited number of server storage systems,such as to enable the original server storage systems that store thevolume copies to be shut down for reasons such as maintenance, energyconservation, etc. As one specific example, if the one or more volumecopies stored on a server storage system need more resources than areavailable from that server storage system, one or more of the volumecopies may be migrated to one or more other server storage systems withadditional resources. Overuse of available resources may occur forvarious reasons, such as one or more server storage systems having lessresources than expected, one or more of the server storage systems usingmore resources than expected (or allowed), or, in embodiments in whichavailable resources of one or more server storage systems areintentionally over-committed relative to possible resources needs of oneor more reserved or stored volume copies. For example, if the expectedresources needs of the volume copies are within the available resources,the maximum resource needs may exceed the available resources. Overuseof available resources may also occur if the actual resources needed forvolume storage or use exceed the available resources.

It will be appreciated that in some embodiments the functionalityprovided by the routines discussed above may be provided in alternativeways, such as being split among more routines or consolidated into fewerroutines. Similarly, in some embodiments, illustrated routines mayprovide more or less functionality than is described, such as when otherillustrated routines instead lack or include such functionalityrespectively, or when the amount of functionality that is provided isaltered. In addition, while various operations may be illustrated asbeing performed in a particular manner (e.g., in serial or in parallel)and/or in a particular order, in other embodiments the operations may beperformed in other orders and in other manners. Similarly, the datastructures discussed above may be structured in different manners inother embodiments, such as by having a single data structure split intomultiple data structures or by having multiple data structuresconsolidated into a single data structure, and may store more or lessinformation than is described (e.g., when other illustrated datastructures instead lack or include such information respectively, orwhen the amount or types of information that is stored is altered).

From the foregoing it will be appreciated that, although specificembodiments have been described herein for purposes of illustration,various modifications may be made without deviating from the spirit andscope of the invention. Accordingly, the invention is not limited exceptas by the appended claims and the elements recited therein. In addition,while certain aspects of the invention are presented below in certainclaim forms, the inventors contemplate the various aspects of theinvention in any available claim form. For example, while only someaspects of the invention may currently be recited as being embodied in acomputer-readable medium, other aspects may likewise be so embodied.

What is claimed is:
 1. A computer-implemented method, comprising:providing, by one or more computing systems of a network-accessibleservice, access over one or more computer networks from a firstcomputing instance to a data storage volume on a storage system, whereinthe first computing instance is a virtual machine hosted on a first hostcomputing system of the network-accessible service, and the data storagevolume is attached over a network to the first computing instance as alogical local storage device of the first computing instance; initiatingperformance of one or more first data access requests by the firstcomputing instance to the data storage volume on the storage system;responsive to a determination that execution of a program on the firstcomputing instance is to be switched to a different computing instance:selecting a second computing instance to switch the execution of theprogram, wherein the second computing instance is another virtualmachine hosted on a second host computing system of thenetwork-accessible service; attaching, by a node manager executing onthe second host computing system and managing the second computinginstance, the data storage volume to the second computing instance as alogical local storage device of the second computing instance;providing, by the one or more computing systems, access from the secondcomputing instance to the data storage volume on the storage system; andswitching execution of the program from the first computing instance tothe second computing instance, and continuing the execution of theprogram on the second computing instance; and initiating performance ofone or more second data access requests by the second computing instanceto the data storage volume on the storage system.
 2. Thecomputer-implemented method of claim 1 further comprising moving orreplicating, by the one or more computing systems and before theswitching, the data storage volume from a first data center associatedwith the first computing instance to a second data center associatedwith the second computing instance.
 3. The computer-implemented methodof claim 1 wherein the determination that the execution of the programis to be switched to a different computing instance comprisesidentifying, by the one or more computing systems, one or more problemson the first computing instance.
 4. The computer-implemented method ofclaim 1 wherein the determination that the execution of the program isto be switched to a different computing instance comprises determining,by the one or more computing systems, to conserve energy by shuttingdown at least some operations of the one or more computing systems. 5.The computer-implemented method of claim 1 wherein the switchingincludes, by the one or more computing systems, stopping the executionof the program on the first computing instance, and starting theexecution of the program on the second computing instance.
 6. Thecomputer-implemented method of claim 1 wherein the execution of theprogram on the first computing instance includes executing a first copyof the program on the first computing instance, and wherein theswitching includes performing, by the one or more computing systems andwhile the execution of the first copy of the program on the firstcomputing instance continues, a redirection of communication for theprogram from the first computing instance to the second computinginstance, including executing a second copy of the program on the secondcomputing instance concurrently with the execution of the first copy ofthe program on the first computing instance.
 7. The computer-implementedmethod of claim 6 wherein the performing of the switch includestransferring execution state information from the executing first copyof the program to the executing second copy of the program.
 8. Thecomputer-implemented method of claim 6 wherein the network-accessibleservice is a program execution service that executes programs formultiple users using a plurality of computing systems provided by theprogram execution service, and wherein the method further comprisesselecting a second computing system from the plurality of computingsystems to provide the second computing instance, and performing theexecution of the second copy of the program on the second computinginstance for a user of the program execution service.
 9. Thecomputer-implemented method of claim 1 wherein the network-accessibleservice is a program execution service that executes programs formultiple users using a plurality of computing systems provided by theprogram execution service, and wherein the method further comprisesselecting, before the providing of the access from the first computinginstance, a first computing system from the plurality of computingsystems to provide the first computing instance, and executing theprogram on the first computing instance for a user of the programexecution service.
 10. The computer-implemented method of claim 9wherein the node manager executing on the second host computing systemis a hypervisor of the second host computing system configured tomonitor a plurality of virtual machines on the second host computingsystem.
 11. The computer-implemented method of claim 9 wherein the datastorage volume is provided by a data storage service that is separatefrom the program execution service and that operates the storage system,and wherein the data storage service provides an application programminginterface (API) used by the program execution service for access to thedata storage volume.
 12. The computer-implemented method of claim 1wherein the providing of the access from the first computing instanceincludes attaching the data storage volume to the first computinginstance and providing a logical representation of the data storagevolume for the first computing instance.
 13. A non-transitorycomputer-readable medium with stored contents including softwareinstructions that when executed by one or more computing systems of avirtual compute service cause the one or more computing systems to:execute, for a user of the virtual compute service, a program on a firstcomputing instance provided by the virtual compute service, wherein thefirst computing instance is a virtual machine hosted on a first hostcomputing system of the virtual compute service; provide access over oneor more computer networks from the first computing instance to a datastorage volume on a storage system, wherein the data storage volume isattached over a network to the first computing instance as a logicallocal storage device of the first computing instance; initiateperformance of one or more first data access requests by the firstcomputing instance to the data storage volume on the storage system;responsive to a determination that execution of the program on the firstcomputing instance is to be switched to a different computing instance:select a second computing instance to switch the execution of theprogram, wherein the second computing instance is another virtualmachine hosted on a second host computing system of the virtual computeservice; attach, by a node manager executing on the second hostcomputing system and managing the second computing instance, the datastorage volume to the second computing instance as a logical localstorage device of the second computing instance; provide access over theone or more computer networks from the second computing instance to thedata storage volume on the storage system; and switch the execution ofthe program from the first computing instance to the second computinginstance, and continue the execution of the program on the secondcomputing instance; and initiate performance of one or more second dataaccess requests by the second computing instance to the data storagevolume on the storage system.
 14. The non-transitory computer-readablemedium of claim 13 wherein to determine that execution of the program isto be switched to a different computing instance, the softwareinstructions when executed by the one or more computing systems furthercause the one or more computing systems to identify one or more problemson a first host computing system providing the first computing instance.15. The non-transitory computer-readable medium of claim 13 wherein todetermine that execution of the program is to be switched to a differentcomputing instance, the software instructions when executed by the oneor more computing systems further cause the one or more computingsystems to determine, before the switching of the execution of theprogram to the second computing instance, to conserve energy by shuttingdown the first host computing system.
 16. The non-transitorycomputer-readable medium of claim 13 wherein to switch the execution ofthe program, the software instructions when executed by the one or morecomputing systems further cause the one or more computing systems tostop the execution of the program on the first computing instance, andafter the stopping, start the execution of the program on the secondcomputing instance.
 17. The non-transitory computer-readable medium ofclaim 13 wherein the execution of the program on the first computinginstance includes executing a first copy of the program on the firstcomputing instance, and wherein to switch the execution of the programto the second computing instance, the software instructions whenexecuted by the one or more computing systems further cause the one ormore computing systems to: perform, while the execution of the firstcopy of the program on the first computing instance continues, aredirection of communication for the program from the first computinginstance to the second computing instance; execute a second copy of theprogram on the second computing instance concurrently with the executionof the first copy of the program on the first computing instance; andtransfer execution state information from the executing first copy ofthe program to the executing second copy of the program.
 18. Thenon-transitory computer-readable medium of claim 13 wherein the datastorage volume stores block data, and wherein the one or more first dataaccess requests and the one or more second data access requests includeone or more requests to store and access block data on the data storagevolume.
 19. A system, comprising: one or more hardware processors; and amemory having stored instructions that, when executed on the one or morehardware processors, cause the system to implement at least a portion ofan online service and cause the online service to: provide access from afirst computing instance to a data storage disk provided on a storagesystem, wherein the first computing instance is a virtual machine hostedon a first host computing system of the online service, and the datastorage disk is attached to the first computing instance over a networkas a logical local storage device of the first computing instance;initiate performance of one or more first data access requests by thefirst computing instance to the data storage disk on the storage system;responsive to a determination that execution of a program on the firstcomputing instance is to be switched to a different computing instance:select a second computing instance to switch the execution of theprogram, wherein the second computing instance is another virtualmachine hosted on a second host computing system of the online service;attach, by a node manager executing on the second host computing systemand managing the second computing instance, the data storage disk to thesecond computing instance as a logical local storage device of thesecond computing instance; provide access from the second computinginstance to the data storage disk on the storage system; and switchexecution of the program from the first computing instance to the secondcomputing instance, and continuing the execution of the program on thesecond computing instance; and initiate performance of one or moresecond data access requests by the second computing instance to the datastorage disk on the storage system.
 20. The system of claim 19 whereinthe online service is a program execution service executing programs formultiple users using a plurality of computing systems provided by theprogram execution service that include a first computing systemproviding the first computing instance and a second computing systemproviding the second computing instance, and wherein the storedinstructions further cause the system to move or replicate the datastorage disk from a first data center associated with the firstcomputing system to a second data center associated with the secondcomputing system.
 21. The system of claim 19 wherein the online serviceincludes the data storage service, wherein the data storage service isconfigured to provide data storage volumes for multiple users using theplurality of storage systems, and wherein the stored instructionsfurther cause the system to: receive the one or more first data accessrequests and the one or more second data access requests over one ormore computer networks from the program executing on the first andsecond computing instances; and perform the one or more first dataaccess requests and the one or more second data access requests on thestorage system, wherein the one or more first data access requests andthe one or more second data access requests include one or more requeststo store data and include one or more requests to access previouslystored data.